Storing large strings for future equality checks

A

Abu Yahya

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

What would be the best approach?
 
M

markspace

I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).


You have to store the whole string. Even if the SHA-512 hash codes are
equal, it could be that the strings are different. You'll have to
eventually compare the raw string, even if the SHA is used as a
quick-out case.

No one can really tell what is "faster" or "wasting time" until you can
better characterize the usage patterns. How big can these strings get?
How often will you get an actual duplicate? What's the penalty when
you need to add a new string? You'll need to implement a few
algorithms, profile them and then make a decision based on actual data.

For Java, I'd store the strings in a WeakHashMap or similar to allow
them to be cached, but tossed away when more storage is needed. Also
you should look into getting some DB caching library, much easier than
implementing this yourself (sorry I can't personally recommend any).
 
D

David Kerber

[This followup was posted to comp.lang.java.databases and a copy was
sent to the cited author.]

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

Unless you're storing millions of these strings or using Access, I'd say
just store and use the string. I think you'll find that the speed
penalty is nearly unnoticeable.

D
 
W

Willem

markspace wrote:
) On 6/8/2011 9:35 AM, Abu Yahya wrote:
)> I considered using an SHA-512 hash of these strings and storing them in
)> the database. However, while these will save on storage space, it will
)> take time to do the hashing before comparing an incoming string. So I'm
)> still wasting time. (Collisions due to hashing will not be a problem,
)> since an occasional false positive will not be fatal for my application).
)
) You have to store the whole string. Even if the SHA-512 hash codes are
) equal, it could be that the strings are different. You'll have to
) eventually compare the raw string, even if the SHA is used as a
) quick-out case.

No he doesn't. Read again. Especially the last bit between parentheses.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
G

Gene Wirchenko

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

Your feeling is irrelevant. You should benchmark. If you do
not, you may end up jumping through hoops for something that is
unneeded (though you may never find out it that it is unneeded).
I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

What does "occasional" mean?
What would be the best approach?

If it is a small app, why are you worrying about it so much?
Create a straightforward design, code it, and see if it is fast
enough. If it is, keep it. If it is not, then and only then, concern
yourself with how to speed it up.

Quit wasting time posting here, and benchmark something. <BEG>

Sincerely,

Gene Wirchenko
 
A

Abu Yahya

You have to store the whole string. Even if the SHA-512 hash codes are
equal, it could be that the strings are different. You'll have to
eventually compare the raw string, even if the SHA is used as a
quick-out case.

You're right about comparing the whole strings anyway, but, for this
application I'm creating, I wouldn't mind very very rare incorrect results.
No one can really tell what is "faster" or "wasting time" until you can
better characterize the usage patterns. How big can these strings get?
How often will you get an actual duplicate? What's the penalty when you
need to add a new string? You'll need to implement a few algorithms,
profile them and then make a decision based on actual data.

That makes sense. I'd need to analyze my input data and then run some
empirical tests.
For Java, I'd store the strings in a WeakHashMap or similar to allow
them to be cached, but tossed away when more storage is needed. Also you
should look into getting some DB caching library, much easier than
implementing this yourself (sorry I can't personally recommend any).

The WeakHashMap idea looks useful. As for a DB caching library, would
you recommend this as a replacement for the WeakHashMap, or as a complement?

Thanks for the useful pointers.
 
A

Abu Yahya

markspace wrote:
) On 6/8/2011 9:35 AM, Abu Yahya wrote:
)> I considered using an SHA-512 hash of these strings and storing them in
)> the database. However, while these will save on storage space, it will
)> take time to do the hashing before comparing an incoming string. So I'm
)> still wasting time. (Collisions due to hashing will not be a problem,
)> since an occasional false positive will not be fatal for my application).
)
) You have to store the whole string. Even if the SHA-512 hash codes are
) equal, it could be that the strings are different. You'll have to
) eventually compare the raw string, even if the SHA is used as a
) quick-out case.

No he doesn't. Read again. Especially the last bit between parentheses.

Yeah, it seems he missed that last part or read it incorrectly.
 
A

Abu Yahya

[This followup was posted to comp.lang.java.databases and a copy was
sent to the cited author.]

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

Unless you're storing millions of these strings or using Access, I'd say
just store and use the string. I think you'll find that the speed
penalty is nearly unnoticeable.

Thanks - simply storing them the way they are does seem to be the best
way forward.
 
L

Lothar Kimmeringer

F'up to cljp

Abu said:
I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

If you write seldom and read often, why not using two columns:

string_hashcode
sha1_hashcode

If the first is equal, you can calculate the sha1-hash for the string
to be checked and if that is equal as well, you can consider the
string as equal. That both hashes collide I expect to be very
very unlikely (which is why I changed the other alg to sha-1, that
should be considerably more performant than sha512).

So calculation of the more complex algorithm is only done while
storing to the database and when checking a string that is already
in the database. If you have that case very often you still might
get a better performance with String.hashcode and SHA1 than with
just SHA512.
What would be the best approach?

There is no single best approach, only an optimal one. Which
one it is dependend on what defines one way to be better than
the other (in terms of performance, storage-space, collision-
rates, etc).


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
A

Abu Yahya

What does "occasional" mean?

The application will be doing some probabilistic evaluation using the
data it is storing, and needs to be fairly close to the actual /most/ of
the time. Because it would anyway be impossible to be correct 100% of
the time, using a wrong value for a particular piece of input data will
not matter, provided it does not happen too often. (In my text above,
"rare" would have been a better choice of word than "occasional")
 
L

Lothar Kimmeringer

David said:
Unless you're storing millions of these strings or using Access, I'd say
just store and use the string. I think you'll find that the speed
penalty is nearly unnoticeable.

Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
R

Robert Klemme

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

What would be the best approach?

Just out of curiosity: what do those strings represent? What do you do
with them?

Kind regards

robert
 
M

Martin Gregorie

If you write seldom and read often, why not using two columns:

string_hashcode
sha1_hashcode
You haven't given us a maximum size for the string or the name of the
target database, so its possible that implementation constraints will
prevent the strings from fitting into a character() column and you'll
have to use a character varying, text, clob etc. instead.

If this type isn't allowed as the table's primary key, you can use the
hash code as the primary key. Since it doesn't matter if some strings
have clashing hash codes this can't cause a problem.
 
T

Tom Anderson

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

So >1000, but by any chance <2000? <4000?
I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

I assume by 'table' you mean an in-memory hashtable. I wouldn't be so
quick to discount this - how many strings do you have? If you had 25 000
2000-character strings, that would be 100 MB; that's not an exorbitant
amount. Access would be pretty quick.
I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my
application).

If you're using a database, don't bother with a hash, just put the whole
string in, index the column, and then do equality queries on it. A B-tree
index will deal with this kind of query pretty efficiently.

If you wanted to pursue in-memory approaches, i'd suggest constructing a
trie of some sort. Tries are very fast at retrieval, don't need to access
the whole key to identify a miss, and only need to access the whole key
once to identify a hit - you always walk through the key from beginning to
end (perhaps skipping some characters in some kinds of tree), stopping as
soon as the key doesn't match, and only reaching the end if it does match.
I haven't found a good overview of the current state of the art in tries,
but look up Patricia tries, Judy tries, burst tries, fusion tries, and HAT
tries. You could consider only storing a prefix of the strings - the first
100 characters, say - in the trie, to save memory, with the leaves having
pointers to where the complete strings are stored on disk.

tom
 
H

Harry Tuttle

Lothar Kimmeringer, 08.06.2011 20:31:
Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.

No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
 
H

Hallvard B Furuseth

Gene said:
Your feeling is irrelevant. You should benchmark. If you do
not, you may end up jumping through hoops for something that is
unneeded (though you may never find out it that it is unneeded).

Indeed. No point in using a lot of time to solve a non-problem. But
after the benchmark, the decision can depend on who the application is
for. If it scales poorly, that can bite other users with different
input data.

OTOH if he'll be the only user and he finds that full strings and SHA
are both too slow: Another approach would be to use a faster hash, count
hash collisions, and don't bother with more if the count is acceptable.

Or try tries, as Tom Anderson suggested.
 
M

Michael Wojcik

Gene said:
What does "occasional" mean?

Assuming the application doesn't accidentally run afoul of a
hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it
means "once in every N/2**256 uses", where N is the current number of
hashes in the database. (2**256 because of the Birthday Paradox; we're
interested in the probability of two arbitrary colliding pre-images.)

Or, in other words, "probably not before the heat death of the
universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.)

Worrying about the time to hash the string is silly. It's linear in
the length of the string, so it's roughly equivalent to the time taken
to do a few comparisons (where "a few" depends on how long, on
average, the matching prefix of the new string and the candidates is).

However, as various others have pointed out, this is premature
optimization. There's no reason to use any design other than the
obvious until a problem is demonstrated.
 
J

Joshua Maurice

A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

What would be the best approach?

If it's that relevant that you're asking, measure first to see if it's
a problem. If you're that concerned that it will be, then code a
number of reasonable alternatives and measure.

Presumably you need to do a Map lookup on the incoming strings. I
thought about some itern scheme, but that won't work if you're
receiving a lot of incoming new strings. Storing hashs could work. Do
you need to store the strings in a database? If you can store them
locally, maybe a trie?
http://en.wikipedia.org/wiki/Trie
I somewhat doubt (maybe?) that you're going to get much better lookup
performance than a trie (but of course I would measure too).
 
H

Harry Tuttle

Lothar Kimmeringer, 08.06.2011 20:31:
Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.

No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top