Would you say, then, that dict insertion is O(N)?
Pedantically, yes.
But since we're allowed to state (or even imply *wink*) whatever
assumptions we like, we're allowed to assume "in the absence of
significant numbers of hash collisions" and come up with amortized O(1)
for dict insertions and lookups.
(Provided, of course, that your computer has an infinite amount of
unfragmented memory and the OS never starts paging your dict to disk.
Another unstated assumption that gets glossed over when we talk about
complexity analysis -- on real world computers, for big enough N,
*everything* is O(2**N) or worse.)
Big Oh analysis, despite the formal mathematics used, is not an exact
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is
built on some awfully shaky foundations. But despite that, it actually is
useful.
Coming back to strings... given that in any real-world application, you
are likely to have some string comparisons on equal strings and some on
unequal strings, and more importantly you don't know which are which
ahead of time, which attitude is less likely to give you a nasty surprise
when you run your code?
"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(1) so that will be fast."
"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(N) so that will be slow, better
find another algorithm."
Remember too that "for small enough N, everything is O(1)". Getting hung
up on Big Oh is just as much a mistake as ignoring it completely.
I find this idea of separating into the comparison of equal strings
versus the comparison of unequal strings rather odd. If the strings you
compare come from a distribution where they are guaranteed to be equal
(or unequal) then you can just use the O(0) comparison method.
If you know that they're (un)equal, you don't need to call == at all.
If you know that "most" strings will be unequal, then you might be
justified in treating comparisons as O(1) "most of the time" and not
stress about the occasional slow call. But of course most of the time you
don't know this, which is why it is appropriate to treat string
comparisons as O(N) rather than O(1), since that's the upper bound.
Since string comparison is only useful if the strings can be equal or
unequal, the average case depends on how often they are equal/unequal as
well as the average complexity of both. For random strings the frequency
of equal strings decreases very fast as N increases so that the
comparison of random strings is O(1).
But that is not an upper bound, and Big Oh analysis is strictly defined
in terms of upper bounds.
I'm not sure where the extra N comes from --------^ but otherwise good.
The extra N comes from the case where you compare string S with itself,
which takes exactly N comparisons.