Well, here's the way it works in my mind:
I can store a set of a zillion strings (or a dict with a zillion
string keys), but every time I test "if new_string in seen_strings",
the computer hashes the new_string using some kind of "short hash",
checks the set for matching buckets (I'm assuming this is how python
tests set membership --- is that right?),
So far so good. That applies to all objects, not just strings.
then checks any
hash-matches for string equality. Testing string equality is slower
than integer equality, and strings (unless they are really short)
take up a lot more memory than long integers.
But presumably you have to keep the string around anyway. It's going to
be somewhere, you can't just throw the string away and garbage collect
it. The dict doesn't store a copy of the string, it stores a reference to
it, and extra references don't cost much.
As for string equality, in pseudo-code it looks something like this:
def __eq__(self, other):
# Consider only the case where other is a string as well.
if self is other: # object identity
return True
if len(self) != len(other): # checking the length is fast
return False
# iterate over both strings in lock-step
for a in self, b in other:
if a != b: return False
return True
only written in fast C code. So the equality test bails out as quickly as
possible, and the only time you have to look at every character is if the
two strings are equal but not the same object.
MD5 hashing, on the other hand, *always* has to look at every character,
and perform quite a complicated set of computations. It's expensive.
So for that kind of thing, I tend to store MD5 hashes or something
similar. Is that wrong?
First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.
You're making the cardinal mistake of optimization, not what is slow, but
what you *assume* will be slow. (Or, replace memory consumption for speed
if you prefer.) Python is not C, and your intuitions for what's fast and
slow (low and high memory consumption) are likely to be way off. Consider
trying to store an MD5 hash instead of the string "Hello World!". If I
try this in Python 2.7, I get this:
py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30
I save a lousy 3 bytes. But in order to save that 3 bytes, I have to
generate an md5 object (32 bytes) and a hex string (53 bytes), both of
which take time to create and then time to garbage collect.
Actually, I don't even save those 3 bytes, since for nearly all
reasonable applications, I'll need to keep the string around somewhere.
So instead of saving three bytes, it's costing me thirty bytes for the
hash, plus who-knows-what needed to manage the book keeping to link that
hash to the actual string.
Ah, but maybe we can save time hashing them?
py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191
Times are in seconds for one million iterations of the code being timed.
Or alternatively, it costs about 0.14 microseconds to hash the string,
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate
the MD5 hash in the first place, that takes *much* longer:
py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188
but merely to perform a lightweight hash on the long int so as to find
which bucket of the dict it belongs to.
So, yes, chances are **very** good that you're supposed optimizations in
this area are actually pessimizations. If you have not measured where the
bottlenecks in your code are, you have no idea which parts of the code
need to be optimized.
Now, it is conceivable that there may be some circumstances where your
strategy makes sense. I'm guessing you would need something like this
before you even come close to saving memory and/or time:
- rather than short keys like "Hello World!", you are dealing with
keys that are typically many tens of thousands of characters long;
- most of the strings are the same length and have very similar
prefixes (e.g. the first 3000 chars are the same);
- rather than "zillions" of them, there are few enough of them that
the chances of an MD5 collision is insignificant;
(Any MD5 collision is going to play havoc with your strategy of
using hashes as a proxy for the real string.)
- and you can arrange matters so that you never need to MD5 hash a
string twice.
Is my guess closer to the actuality than yours? I don't know. I haven't
measured it either. But I know that Python is a high-level language with
lots of high-level data structures like dicts which trade-off time and
memory for programmer convenience, and that I'd want to see some real
benchmarks proving that my application was too slow before giving up that
convenience with a complicated strategy like this.