Comparing strings from the back?

S

Steven D'Aprano

That sort of string isn't a normal thing to be comparing, though.

Here's an idea. Someone who's doing a lot of arguing in this thread
should take Python, find the string comparison routine, and hack in some
statistics-gathering. Then run *real code* on it.

Where's the fun in that?

:p
 
O

Oscar Benjamin

<snip>

After further thought, and giving consideration to the arguments given by
people here, I'm now satisfied to say that for equal-length strings,
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N
characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
all N characters.

3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh
behaviour. See, for example, "Introduction To Algorithms"
by Cormen, Leiserson and Rivest.

Would you say, then, that dict insertion is O(N)?
Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the
discussion, the average case. I am now satisfied that the average number
of comparisons for unequal strings is O(1). To be precise, it is bounded
below by 1 comparison (you always have to compare at least one pair of
characters) and bounded above by 2 comparisons.

I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).
(I'm talking about the average here -- the actual number of comparisons
can range all the way up to N, but the average is <= 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

I'm not sure where the extra N comes from --------^ but otherwise good.

I would have written that as:

(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))

where p is the probability of a match (1/M for M equally likely characters) or
in closed form:

⎛ N ⎞
âŽ1 - p â‹…(1 + N â‹…(1 - p))⎠
─────────────────────────
1 - p
for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases;
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
and so forth.

It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)

Oscar
 
O

Oscar Benjamin

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).


I'm not sure where the extra N comes from --------^ but otherwise good.

Ok, I see it's for the case where they're equal.

Oscar
 
S

Steven D'Aprano

Would you say, then, that dict insertion is O(N)?

Pedantically, yes.

But since we're allowed to state (or even imply *wink*) whatever
assumptions we like, we're allowed to assume "in the absence of
significant numbers of hash collisions" and come up with amortized O(1)
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of
unfragmented memory and the OS never starts paging your dict to disk.
Another unstated assumption that gets glossed over when we talk about
complexity analysis -- on real world computers, for big enough N,
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is
built on some awfully shaky foundations. But despite that, it actually is
useful.

Coming back to strings... given that in any real-world application, you
are likely to have some string comparisons on equal strings and some on
unequal strings, and more importantly you don't know which are which
ahead of time, which attitude is less likely to give you a nasty surprise
when you run your code?

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(1) so that will be fast."

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(N) so that will be slow, better
find another algorithm."


Remember too that "for small enough N, everything is O(1)". Getting hung
up on Big Oh is just as much a mistake as ignoring it completely.

I find this idea of separating into the comparison of equal strings
versus the comparison of unequal strings rather odd. If the strings you
compare come from a distribution where they are guaranteed to be equal
(or unequal) then you can just use the O(0) comparison method.

If you know that they're (un)equal, you don't need to call == at all.

If you know that "most" strings will be unequal, then you might be
justified in treating comparisons as O(1) "most of the time" and not
stress about the occasional slow call. But of course most of the time you
don't know this, which is why it is appropriate to treat string
comparisons as O(N) rather than O(1), since that's the upper bound.

Since string comparison is only useful if the strings can be equal or
unequal, the average case depends on how often they are equal/unequal as
well as the average complexity of both. For random strings the frequency
of equal strings decreases very fast as N increases so that the
comparison of random strings is O(1).

But that is not an upper bound, and Big Oh analysis is strictly defined
in terms of upper bounds.


I'm not sure where the extra N comes from --------^ but otherwise good.

The extra N comes from the case where you compare string S with itself,
which takes exactly N comparisons.
 
O

Oscar Benjamin

Pedantically, yes.

But since we're allowed to state (or even imply *wink*) whatever
assumptions we like, we're allowed to assume "in the absence of
significant numbers of hash collisions" and come up with amortized O(1)
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of
unfragmented memory and the OS never starts paging your dict to disk.
Another unstated assumption that gets glossed over when we talk about
complexity analysis -- on real world computers, for big enough N,
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is
built on some awfully shaky foundations. But despite that, it actually is
useful.

Coming back to strings... given that in any real-world application, you
are likely to have some string comparisons on equal strings and some on
unequal strings, and more importantly you don't know which are which
ahead of time, which attitude is less likely to give you a nasty surprise
when you run your code?

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(1) so that will be fast."

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(N) so that will be slow, better
find another algorithm."

True. I can't think of a situation where I've used string comparisons
directly in any text heavy code. Rather, I would use a dict or a set (or a
regex) and hash(str) is always O(N).
Remember too that "for small enough N, everything is O(1)". Getting hung
up on Big Oh is just as much a mistake as ignoring it completely.

I can't think of a situation in my own work where O(N) vs O(1) string
comparisons would cause a significant problem (except perhaps in libraries
that I use but didn't write). However, I can find a number of cases where I
compare numpy.ndarrays for equality. For example, I found

if np.all(a == b):

in some code that I recently wrote. Although np.all() short-circuits, a==b
does not so that line forces O(N) behaviour onto a situation where the average
case can be better. Unfortunately numpy doesn't seem to provide a
short-circuit equals() function. array_equal() is what I want but it does the
same as the above. In future, I'll consider using something like

def cmparray(a, b):
return a.shape == b.shape and a.dtype == b.dtype and buffer(a) == buffer(b)

to take advantage of (what I assume are) short-circuit buffer comparisons.
But that is not an upper bound, and Big Oh analysis is strictly defined
in terms of upper bounds.

It is an upper bound, but it is an upper bound on the *expectation value*
assuming a particular distribution of inputs, rather than an upper bound on
all possible inputs.

The average is actually bounded by 1 / (1 - p) where p is the probability that
two characters match. This bound can be arbitrarily large as p approaches 1 as
would be the case if, say, one character was much more likely than others. The
particular assumption that you have made p = 1/M where M is the number of
characters is actually the *smallest* possible value of p. For non-uniform
real data (English words for example) p is significantly greater than 1/M but
in a strict bounds sense we should say that 1/M <= p <= 1.

Oscar
 
G

Gelonida N

On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:

I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?).

Yeah I think you mixed it up with searching an entry in an unsorted list
of length N


That's one of the most common N/2 cases, that one hits daily with many
straight forward implementations
 
G

Gelonida N

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:


Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.

The worst case is O(N) or N characters
the average case is O(1) or two characters.

For denial of service attacks or systems, that are NEVER allowed to fail
the worst case is important.

For most other cases the average complexity counts.

However I still wonder for how many applications the complexity of
string comparisons would be the limiting factor.
 
S

Steven D'Aprano

and of course if you ever do find an application where that worst case
matters there's an easy way round it: just call intern() on all the
strings when they are created.


There are other good reasons for interning strings, but speeding up
"astring == bstring" is not one of them.


[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
"t = s[:-1] + 'x'" "s == t"
1000 loops, best of 3: 910 usec per loop
[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
"t = s[:-1] + 'x'" -s "intern(s); intern(t)" "s == t"
1000 loops, best of 3: 914 usec per loop

No significant difference.

To state the bleedin' obvious, the computational effort required to
*compare two strings* pre-supposes that those strings already exist, and
is *completely independent* of the complexity of creating the strings.

For the comparison to be the limiting factor you have to be doing a lot
of comparisons
Correct.

on the same string (otherwise creating the string would be the limiting
factor),

Not necessarily.

Look, it's really hard to come up with a realistic, non-contrived example
where string comparisons are a significant bottleneck in a non-toy
application. So let me first off say that *nearly always* you're not
going to care whether "s == t" looks at all N characters or just the
first 2 (or 20, or 100). This argument is rather academic (the best sort
of argument!). Until you start getting up into truly huge strings, we're
arguing about how many angels can dance on a CPU cache.

But for the record, in principle string comparisons *could* be the
bottleneck. Example: you have 10000 strings, which are each created once
and stored in a list. Then you iterate over the list, comparing every
string against every other string. And due to some weird vagary of the
data, the strings are nearly all equal.

(Why would you do this? I don't know, maybe it's a programmers' challenge
found on the Internet, make up your own scenario...)

Total number of strings created: 10000.
Total number of strings compared: 100000000.

The overhead of creating the strings is trivial compared to the overhead
of comparing them, and since each string is only created once anyway,
interning them is just a waste of time.

so at the expense of a single dictionary
insertion when the string is created you can get guaranteed O(1) on all
the comparisons.

What interning buys you is that "s == t" is an O(1) pointer compare if
they are equal. But if s and t differ in the last character, __eq__ will
still inspect every character. There is no way to tell Python "all
strings are interned, if s is not t then s != t as well".
 
O

Oscar Benjamin

What interning buys you is that "s == t" is an O(1) pointer compare if
they are equal. But if s and t differ in the last character, __eq__ will
still inspect every character. There is no way to tell Python "all
strings are interned, if s is not t then s != t as well".

I thought that if *both* strings were interned then a pointer comparison could
decide if they were unequal without needing to check the characters.

Have I misunderstood how intern() works?

Oscar
 
C

Chris Angelico

I thought that if *both* strings were interned then a pointer comparison could
decide if they were unequal without needing to check the characters.

Have I misunderstood how intern() works?

In a language where _all_ strings are guaranteed to be interned (such
as Lua, I think), you do indeed gain this. Pointer inequality implies
string inequality. But when interning is optional (as in Python), you
cannot depend on that, unless there's some way of recognizing interned
strings. Of course, that may indeed be the case; a simple bit flag
"this string has been interned" would suffice, and if both strings are
interned AND their pointers differ, THEN you can be sure the strings
differ.

I have no idea whether or not CPython version X.Y.Z does this. The
value of such an optimization really depends on how likely strings are
to be interned; for instance, if the compiler automatically interns
all the names of builtins, this could be quite beneficial. Otherwise,
probably not; most Python scripts don't bother interning anything.

ChrisA
 
O

Oscar Benjamin

In a language where _all_ strings are guaranteed to be interned (such
as Lua, I think), you do indeed gain this. Pointer inequality implies
string inequality. But when interning is optional (as in Python), you
cannot depend on that, unless there's some way of recognizing interned
strings. Of course, that may indeed be the case; a simple bit flag
"this string has been interned" would suffice, and if both strings are
interned AND their pointers differ, THEN you can be sure the strings
differ.

I have no idea whether or not CPython version X.Y.Z does this. The
value of such an optimization really depends on how likely strings are
to be interned; for instance, if the compiler automatically interns
all the names of builtins, this could be quite beneficial. Otherwise,
probably not; most Python scripts don't bother interning anything.

I haven't looked at the source but my understanding was precisely that there
is an intern() bit and that not only the builtins module but all the literals
in any byte-compiled module are interned.

Oscar
 
C

Chris Angelico

s/literals/identifiers/

You can see the interned flag in the PyUnicodeObject struct here:
http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Ah, yep, so that's there. In that case, it's possible to have that
optimization. However, I may be misreading this, but it seems the only
Unicode comparison function is a rich compare, which is unable to take
advantage of a known difference:
http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114

Different pointers prove the strings differ, but don't tell you which
is to be sorted earlier. You could use this if you roll your own
comparison in C; or, if you already know the strings are interned, you
can use 'is' / 'is not'. But that seems to be the extent of it.

ChrisA
 
D

Dan Goodman

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.

Dan
 
O

Oscar Benjamin

From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.

Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).

Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.

Oscar

[1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293
 
D

Dan Goodman

Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).

Yep, but you already have O(N) costs just creating the strings in the
first place, so it's absorbed into that. It's only worth doing if you do
many comparisons.
Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.

Cool, Python is very clever as always! :)

Dan
 
D

Dan Goodman

From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.

Just had another thought about this. Although it's unlikely to be
necessary in practice since (a) it's rarely necessary at all, and (b)
when it is, hashing and optionally interning seems like the better
approach, I had another idea that would be more general. Rather than
starting from the beginning or the end, why not do something like: check
the first and last character, then the len/2 character, then the len/4,
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever
about making sure you hit every character but I'm sure someone's already
got an efficient algorithm for this. You could probably even make this
cache efficient by working on cache line length blocks. Almost certainly
entirely unnecessary, but I like the original question and it's a nice
theoretical problem.

Dan
 
O

Oscar Benjamin

Just had another thought about this. Although it's unlikely to be
necessary in practice since (a) it's rarely necessary at all, and (b)
when it is, hashing and optionally interning seems like the better
approach, I had another idea that would be more general. Rather than
starting from the beginning or the end, why not do something like: check
the first and last character, then the len/2 character, then the len/4,
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever
about making sure you hit every character but I'm sure someone's already
got an efficient algorithm for this. You could probably even make this
cache efficient by working on cache line length blocks. Almost certainly
entirely unnecessary, but I like the original question and it's a nice
theoretical problem.

It's not totally theoretical in the sense that the reasoning applies to all
sequence comparisons. If you needed to compare lists of objects where the
comparison of each pair of elements was an expensive operation then you would
want to think carefully about what order you used. Also in general you can't
hash/intern all sequences.

If I was going to change the order of comparisons for all strings then I would
use a random order. This is essentially how dict gets away with claiming to
have O(1) lookup. There are sequences of inputs that can cause every possible
hash collision to occur but because the hash function acts as a kind of
randomisation filter the pathological sequences are very unlikely to occur
unless someone is going out of their way. The clever way that Python 3.3
prevents someone from even doing this on purpose is just to introduce
additional per-process randomisation.

The difference between dict lookup and string comparison is that string
comparison always compares the characters in the same order and it corresponds
to the natural ordering of the data. This means that some pefectly natural use
cases like comparing file-paths can have close to worst case behaviour. If
string/sequence comparison occurs in a random order then there can be no use
case where the likely strings would induce close to worst case behaviour
unless you really are just comparing lots of almost identical sequences.

Oscar
 
T

Terry Reedy

On 11 September 2012 10:51, Duncan Booth <[email protected]

Oscar Benjamin <[email protected]


I don't think you've misunderstood how it work, but so far as I can
see the
code doesn't attempt to short circuit the "not equal but interned" case.
The comparison code doesn't look at interning at all, it only looks for
identity as a shortcut.


It also doesn't seem to check if the hash values have been set. I guess
the cached hash value is only used in contexts where the hash is
explicitly desired.-

I believe the internal use of interning and hash comparison has varied
from release to release. However, the main use of string comparison is
for dict keys, especially the internal dicts for namespaces and
attributes. Since the dict lookup code needs hash values anyway, to find
slots with possible conflicts, I am sure it does not use the generic
comparison operators but starts with hash comparisons.

Terry Jan Reedy
 
P

Prasad, Ramit

Dwight said:
Why don' you just time it,eit lops through incrementing thmax input/

What? Without context I have no idea what this means.


Ramit

--


This email is confidential and subject to important disclaimers and
conditionsincluding on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,073
Messages
2,570,538
Members
47,195
Latest member
RedaMahuri

Latest Threads

Top