hashing strings to integers for sqlite3 keys

A

Adam Funk

I'm using Python 3.3 and the sqlite3 module in the standard library.
I'm processing a lot of strings from input files (among other things,
values of headers in e-mail & news messages) and suppressing
duplicates using a table of seen strings in the database.

It seems to me --- from past experience with other things, where
testing integers for equality is faster than testing strings, as well
as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
--- that the SELECT tests should be faster if I am looking up an
INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
right?

If so, what sort of hashing function should I use? The "maxint" for
SQLite3 is a lot smaller than the size of even MD5 hashes. The only
thing I've thought of so far is to use MD5 or SHA-something modulo the
maxint value. (Security isn't an issue --- i.e., I'm not worried
about someone trying to create a hash collision.)

Thanks,
Adam
 
P

Peter Otten

Adam said:
I'm using Python 3.3 and the sqlite3 module in the standard library.
I'm processing a lot of strings from input files (among other things,
values of headers in e-mail & news messages) and suppressing
duplicates using a table of seen strings in the database.

It seems to me --- from past experience with other things, where
testing integers for equality is faster than testing strings, as well
as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
--- that the SELECT tests should be faster if I am looking up an
INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
right?

My gut feeling tells me that this would matter more for join operations than
lookup of a value. If you plan to do joins you could use an autoinc integer
as the primary key and an additional string key for lookup.
If so, what sort of hashing function should I use? The "maxint" for
SQLite3 is a lot smaller than the size of even MD5 hashes. The only
thing I've thought of so far is to use MD5 or SHA-something modulo the
maxint value. (Security isn't an issue --- i.e., I'm not worried
about someone trying to create a hash collision.)

Start with the cheapest operation you can think of,

md5(s) % MAXINT

or even

hash(s) % MAXINT # don't forget to set PYTHONHASHSEED

then compare performance with just

s

and only if you can demonstrate a significant speedup keep the complication
in your code.

If you find such a speedup I'd like to see the numbers because this cries
PREMATURE OPTIMIZATION...
 
C

Chris Angelico

I'm using Python 3.3 and the sqlite3 module in the standard library.
I'm processing a lot of strings from input files (among other things,
values of headers in e-mail & news messages) and suppressing
duplicates using a table of seen strings in the database.

It seems to me --- from past experience with other things, where
testing integers for equality is faster than testing strings, as well
as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
--- that the SELECT tests should be faster if I am looking up an
INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
right?

It might be faster to use an integer primary key, but the possibility
of even a single collision means you can't guarantee uniqueness
without a separate check. I don't know sqlite3 well enough to say, but
based on what I know of PostgreSQL, it's usually best to make your
schema mimic your logical structure, rather than warping it for the
sake of performance. With a good indexing function, the performance of
a textual PK won't be all that much worse than an integral one, and
everything you do will read correctly in the code - no fiddling around
with hashes and collision checks.

Stick with the TEXT PRIMARY KEY and let the database do the database's
job. If you're processing a really large number of strings, you might
want to consider moving from sqlite3 to PostgreSQL anyway (I've used
psycopg2 quite happily), as you'll get better concurrency; and that
might solve your performance problem as well, as Pg plays very nicely
with caches.

ChrisA
 
T

Tim Chase

I'm using Python 3.3 and the sqlite3 module in the standard library.
I'm processing a lot of strings from input files (among other
things, values of headers in e-mail & news messages) and suppressing
duplicates using a table of seen strings in the database.

It seems to me --- from past experience with other things, where
testing integers for equality is faster than testing strings, as
well as from reading the SQLite3 documentation about INTEGER
PRIMARY KEY --- that the SELECT tests should be faster if I am
looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY
KEY. Is that right?

If sqlite can handle the absurd length of a Python long, you *can* do
it as ints:
703993777145756967576188115661016000849227759454L

That's a pretty honkin' huge int for a DB key, but you can use it.
And it's pretty capped on length regardless of the underlying
string's length.
If so, what sort of hashing function should I use? The "maxint" for
SQLite3 is a lot smaller than the size of even MD5 hashes. The only
thing I've thought of so far is to use MD5 or SHA-something modulo
the maxint value. (Security isn't an issue --- i.e., I'm not
worried about someone trying to create a hash collision.)

You could truncate that to something like

which should give you something that would result in a 32-bit number
that should fit in sqlite's int.

-tkc
 
A

Adam Funk

My gut feeling tells me that this would matter more for join operations than
lookup of a value. If you plan to do joins you could use an autoinc integer
as the primary key and an additional string key for lookup.

I'm not doing any join operations. I'm using sqlite3 for storing big
piles of data & persistence between runs --- not really "proper
relational database use". In this particular case, I'm getting header
values out of messages & doing this:

for this_string in these_strings:
if not already_seen(this_string):
process(this_string)
# ignore if already seen

....
and only if you can demonstrate a significant speedup keep the complication
in your code.

If you find such a speedup I'd like to see the numbers because this cries
PREMATURE OPTIMIZATION...

On further reflection, I think I asked for that. In fact, the table
I'm using only has one column for the hashes --- I wasn't going to
store the strings at all in order to save disk space (maybe my mind is
stuck in the 1980s).
 
A

Adam Funk

It might be faster to use an integer primary key, but the possibility
of even a single collision means you can't guarantee uniqueness
without a separate check. I don't know sqlite3 well enough to say, but
based on what I know of PostgreSQL, it's usually best to make your
schema mimic your logical structure, rather than warping it for the
sake of performance. With a good indexing function, the performance of
a textual PK won't be all that much worse than an integral one, and
everything you do will read correctly in the code - no fiddling around
with hashes and collision checks.

Stick with the TEXT PRIMARY KEY and let the database do the database's
job. If you're processing a really large number of strings, you might
want to consider moving from sqlite3 to PostgreSQL anyway (I've used
psycopg2 quite happily), as you'll get better concurrency; and that
might solve your performance problem as well, as Pg plays very nicely
with caches.

Well, actually I'm thinking about doing away with checking for
duplicates at this stage, since the substrings that I pick out of the
deduplicated header values go into another table as the TEXT PRIMARY
KEY anyway, with deduplication there. So I think this stage reeks of
premature optimization.
 
A

Adam Funk

If sqlite can handle the absurd length of a Python long, you *can* do
it as ints:

It can't. SQLite3 INTEGER is an 8-byte signed one.

https://www.sqlite.org/datatype3.html

But after reading the other replies to my question, I've concluded
that what I was trying to do is pointless.

703993777145756967576188115661016000849227759454L

That ties in with a related question I've been wondering about lately
(using MD5s & SHAs for other things) --- getting a hash value (which
is internally numeric, rather than string, right?) out as a hex string
& then converting that to an int looks inefficient to me --- is there
any better way to get an int? (I haven't seen any other way in the
API.)
 
C

Chris Angelico

On further reflection, I think I asked for that. In fact, the table
I'm using only has one column for the hashes --- I wasn't going to
store the strings at all in order to save disk space (maybe my mind is
stuck in the 1980s).

That's a problem, then, because you will see hash collisions. Maybe
not often, but they definitely will occur if you have enough strings
(look up the birthday paradox - with a 32-bit arbitrarily selected
integer (such as a good crypto hash that you then truncate to 32
bits), you have a 50% chance of a collision at just 77,000 strings).

Do you have enough RAM to hold all the strings directly? Just load 'em
all up into a Python set. Set operations are fast, clean, and easy.
Your already_seen function becomes a simple 'in' check. These days you
can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
average string size of 80 characters, and assuming Python 3.3+, that's
about 128 bytes each - close enough, and a nice figure. 16GB divided
by 128 gives 128M strings - obviously you won't get all of that, but
that's your ball-park. Anything less than, say, a hundred million
strings, and you can dump the lot into memory. Easy!

ChrisA
 
C

Chris Angelico

That ties in with a related question I've been wondering about lately
(using MD5s & SHAs for other things) --- getting a hash value (which
is internally numeric, rather than string, right?) out as a hex string
& then converting that to an int looks inefficient to me --- is there
any better way to get an int? (I haven't seen any other way in the
API.)

I don't know that there is, at least not with hashlib. You might be
able to use digest() followed by the struct module, but it's no less
convoluted. It's the same in several other languages' hashing
functions; the result is a string, not an integer.

ChrisA
 
A

Adam Funk

That's a problem, then, because you will see hash collisions. Maybe
not often, but they definitely will occur if you have enough strings
(look up the birthday paradox - with a 32-bit arbitrarily selected
integer (such as a good crypto hash that you then truncate to 32
bits), you have a 50% chance of a collision at just 77,000 strings).

Ah yes, there's a handy table for that:

https://en.wikipedia.org/wiki/Birthday_attack#Mathematics

Do you have enough RAM to hold all the strings directly? Just load 'em
all up into a Python set. Set operations are fast, clean, and easy.
Your already_seen function becomes a simple 'in' check. These days you
can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
average string size of 80 characters, and assuming Python 3.3+, that's
about 128 bytes each - close enough, and a nice figure. 16GB divided
by 128 gives 128M strings - obviously you won't get all of that, but
that's your ball-park. Anything less than, say, a hundred million
strings, and you can dump the lot into memory. Easy!

Good point, & since (as I explained in my other post) the substrings
are being deduplicated in their own table anyway it's probably not
worth bothering with persistence between runs for this bit.
 
A

Adam Funk

I don't know that there is, at least not with hashlib. You might be
able to use digest() followed by the struct module, but it's no less
convoluted. It's the same in several other languages' hashing
functions; the result is a string, not an integer.

Well, J*v* returns a byte array, so I used to do this:

digester = MessageDigest.getInstance("MD5");
...
digester.reset();
byte[] digest = digester.digest(bytes);
return new BigInteger(+1, digest);

I dunno why language designers don't make it easy to get a single big
number directly out of these things.


I just had a look at the struct module's fearsome documentation &
think it would present a good shoot(self, foot) opportunity.
 
C

Chris Angelico

Well, J*v* returns a byte array...

I counted byte arrays along with strings. Whether it's notionally a
string of bytes or characters makes no difference - it's not an
integer.
I dunno why language designers don't make it easy to get a single big
number directly out of these things.

It's probably because these sorts of hashes are usually done on large
puddles of memory, to create a smaller puddle of memory. How you
interpret the resulting puddle is up to you; maybe you want to think
of it as a number, maybe as a string, but really it's just a sequence
of bytes.

ChrisA
 
P

Peter Otten

Adam said:
I don't know that there is, at least not with hashlib. You might be
able to use digest() followed by the struct module, but it's no less
convoluted. It's the same in several other languages' hashing
functions; the result is a string, not an integer.

Well, J*v* returns a byte array, so I used to do this:

digester = MessageDigest.getInstance("MD5");
...
digester.reset();
byte[] digest = digester.digest(bytes);
return new BigInteger(+1, digest);

In Python 3 there's int.from_bytes()
538059071683667711846616050503420899184350089339

I dunno why language designers don't make it easy to get a single big
number directly out of these things.

You hardly ever need to manipulate the numerical value of the digest. And on
its way into the database it will be re-serialized anyway.
 
A

Adam Funk

Adam Funk wrote:
Well, J*v* returns a byte array, so I used to do this:

digester = MessageDigest.getInstance("MD5");
...
digester.reset();
byte[] digest = digester.digest(bytes);
return new BigInteger(+1, digest);

In Python 3 there's int.from_bytes()
538059071683667711846616050503420899184350089339

Excellent, thanks for pointing that out. I've just recently started
using Python 3 instead of 2, & appreciate pointers to new things like
that. The only thing that really bugs me in Python 3 is that execfile
has been removed (I find it useful for testing things interactively).

You hardly ever need to manipulate the numerical value of the digest. And on
its way into the database it will be re-serialized anyway.

I now agree that my original plan to hash strings for the SQLite3
table was pointless, so I've changed the subject header. :)

I have had good reason to use int hashes in the past, however. I was
doing some experiments with Andrei Broder's "sketches of shingles"
technique for finding partial duplication between documents, & you
need integers for that so you can do modulo arithmetic.

I've also used hashes of strings for other things involving
deduplication or fast lookups (because integer equality is faster than
string equality). I guess if it's just for deduplication, though, a
set of byte arrays is as good as a set of int?
 
A

Adam Funk

On 2014-05-22, Peter Otten wrote:

Excellent, thanks for pointing that out. I've just recently started
using Python 3 instead of 2, & appreciate pointers to new things like
that.

BTW, I just tested that & it should be "big" for consistency with the
hexdigest:

Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.1315090007003469710610607131160586168131917298749

Thanks.
 
C

Chris Angelico

I've also used hashes of strings for other things involving
deduplication or fast lookups (because integer equality is faster than
string equality). I guess if it's just for deduplication, though, a
set of byte arrays is as good as a set of int?

Want a better way to do that? Use a set or dict and let Python do the
hashing. It'll be every bit as fast as explicit hashing, plus you get
the bonus of simplicity.

ChrisA
 
T

Terry Reedy

that. The only thing that really bugs me in Python 3 is that execfile
has been removed (I find it useful for testing things interactively).

The spelling has been changed to exec(open(...).read(), ... . It you use
it a lot, add a customized def execfile(filename, ... to your site
module or local utils module.

Execfile was a separate statement *only) because exec was a statememt.
Once exec was was changed to a function taking arguments, that
justification disappeared.

Terry Jan Reedy
 
A

Adam Funk

Want a better way to do that? Use a set or dict and let Python do the
hashing. It'll be every bit as fast as explicit hashing, plus you get
the bonus of simplicity.

Well, here's the way it works in my mind:

I can store a set of a zillion strings (or a dict with a zillion
string keys), but every time I test "if new_string in
seen_strings", the computer hashes the new_string using some kind
of "short hash", checks the set for matching buckets (I'm assuming
this is how python tests set membership --- is that right?), then
checks any hash-matches for string equality. Testing string
equality is slower than integer equality, and strings (unless they
are really short) take up a lot more memory than long integers.

So for that kind of thing, I tend to store MD5 hashes or something
similar. Is that wrong?
 
S

Steven D'Aprano

Well, here's the way it works in my mind:

I can store a set of a zillion strings (or a dict with a zillion
string keys), but every time I test "if new_string in seen_strings",
the computer hashes the new_string using some kind of "short hash",
checks the set for matching buckets (I'm assuming this is how python
tests set membership --- is that right?),

So far so good. That applies to all objects, not just strings.

then checks any
hash-matches for string equality. Testing string equality is slower
than integer equality, and strings (unless they are really short)
take up a lot more memory than long integers.

But presumably you have to keep the string around anyway. It's going to
be somewhere, you can't just throw the string away and garbage collect
it. The dict doesn't store a copy of the string, it stores a reference to
it, and extra references don't cost much.

As for string equality, in pseudo-code it looks something like this:

def __eq__(self, other):
# Consider only the case where other is a string as well.
if self is other: # object identity
return True
if len(self) != len(other): # checking the length is fast
return False
# iterate over both strings in lock-step
for a in self, b in other:
if a != b: return False
return True

only written in fast C code. So the equality test bails out as quickly as
possible, and the only time you have to look at every character is if the
two strings are equal but not the same object.

MD5 hashing, on the other hand, *always* has to look at every character,
and perform quite a complicated set of computations. It's expensive.

So for that kind of thing, I tend to store MD5 hashes or something
similar. Is that wrong?

First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.

You're making the cardinal mistake of optimization, not what is slow, but
what you *assume* will be slow. (Or, replace memory consumption for speed
if you prefer.) Python is not C, and your intuitions for what's fast and
slow (low and high memory consumption) are likely to be way off. Consider
trying to store an MD5 hash instead of the string "Hello World!". If I
try this in Python 2.7, I get this:


py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30


I save a lousy 3 bytes. But in order to save that 3 bytes, I have to
generate an md5 object (32 bytes) and a hex string (53 bytes), both of
which take time to create and then time to garbage collect.

Actually, I don't even save those 3 bytes, since for nearly all
reasonable applications, I'll need to keep the string around somewhere.
So instead of saving three bytes, it's costing me thirty bytes for the
hash, plus who-knows-what needed to manage the book keeping to link that
hash to the actual string.

Ah, but maybe we can save time hashing them?

py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191

Times are in seconds for one million iterations of the code being timed.
Or alternatively, it costs about 0.14 microseconds to hash the string,
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate
the MD5 hash in the first place, that takes *much* longer:

py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188

but merely to perform a lightweight hash on the long int so as to find
which bucket of the dict it belongs to.

So, yes, chances are **very** good that you're supposed optimizations in
this area are actually pessimizations. If you have not measured where the
bottlenecks in your code are, you have no idea which parts of the code
need to be optimized.

Now, it is conceivable that there may be some circumstances where your
strategy makes sense. I'm guessing you would need something like this
before you even come close to saving memory and/or time:

- rather than short keys like "Hello World!", you are dealing with
keys that are typically many tens of thousands of characters long;

- most of the strings are the same length and have very similar
prefixes (e.g. the first 3000 chars are the same);

- rather than "zillions" of them, there are few enough of them that
the chances of an MD5 collision is insignificant;

(Any MD5 collision is going to play havoc with your strategy of
using hashes as a proxy for the real string.)

- and you can arrange matters so that you never need to MD5 hash a
string twice.

Is my guess closer to the actuality than yours? I don't know. I haven't
measured it either. But I know that Python is a high-level language with
lots of high-level data structures like dicts which trade-off time and
memory for programmer convenience, and that I'd want to see some real
benchmarks proving that my application was too slow before giving up that
convenience with a complicated strategy like this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top