]
Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or
bitstrings.
Excuse me, you are not. You are comparing English words which are highly
non-random.
You, on the other hand, are making vague assumptions which you do not
care for formalize and yet you claim that "the number of comparisons is
equally likely to be 1, 2, 3, ..., N. The average then is". Without any
explanation for this. At all.
I will accept that my explanation was not good enough, but I strongly
disagree that I gave no explanation at all.
Is your assumtion that we're comparing words that have the common prefix
"Herr Professor Frederick "?
No, I am pointing out that *your* assumption that most string comparisons
will halt close to the beginning of the string is an invalid assumption.
Your assumption only holds for some non-random strings.
[...]
You're wrong. It's not counting the number of string comparisons,
I didn't say it was.
it's counting the number of character comparisons.
Correct. But only out of an extremely limited subset of all possible
string comparisons, namely those very few that happen when sorting lists
of English words using a very specific algorithm, namely timsort.
Which is exactly what
we're talking about in this thread, are we not? The sorting of the list
is just to show some example where lots of strings are compared.
It is not good enough to extract a non-random subset of strings (only
English words) and then decrease the data set even further by only
comparing an extremely non-uniform subset of those strings: specifically
those few string comparisons that happen to occur using timsort.
Whatever figure you have calculated by taking this non-random selection,
it is irrelevant to the question of the average performance of string
equality given random strings.
Again, it's not counting the number of string comparisons. It's counting
the average number of character comparisons per string comparison. The
total amount of string comparisons has nothing to do with it.
I didn't say anything about counting string comparisons. But you are
wrong -- the numbers you get are *strongly* influenced by the number of
string comparisons performed. That is just one of the reasons why the
results you get are biased.
Let me spell it out for you: given a list of five letter words:
['fruit', 'apple', 'claim', 'pears', ... , 'toast']
sorting the list may compare:
'fruit' with 'apple'
'apple' with 'claim'
but almost certainly will not compare:
'apple' with 'toast'
and it definitely won't compare:
'zzzax' with 'zzzaq'
because strings like those never get into the list in the first place.
For the sake of simple calculations, let's pretend that there are only
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter
strings for equality. To calculate this average, you must compare every
string to every other string, *including* itself.
This gives 1000*1000 = one million equality tests to be performed. For
each equality test, we count the number of character comparisons
performed, sum all those counts, and divide by 1e6. That is the average
number of char comparisons for random strings of five letters.
But that's not what you do. First you eliminate 900 out of the 1000
possible strings by only sampling English words -- strings like "xxoij"
cannot possibly be selected. So you are left with the highly non-random
subset of 10000 equality tests.
But you haven't finished. Instead of checking all 10000 equality tests,
you then decide to only check the 2000 tests that happen to randomly
occur when using the timsort algorithm to sort a list.
So from the population of one million possible equality tests, you throw
away the vast majority, then average the *strongly non-random* few that
are remaining. Naturally the result you get is strongly biased, and it
has nothing to do with the average Big Oh behaviour of equality on random
strings.