Extreamly large Hashtable

D

David McDivitt

From: "Chris Uppal said:
Subject: Re: Extreamly large Hashtable
Date: Sat, 7 May 2005 07:58:03 +0100



If a sufficiently good hash function is available, and space is not a problem,
then a big hash table can /always/ beat a big binary lookup. Hashtables can be
made O(1) (with more-or-less any desired constant factor, trading space for
speed), whereas binary lookup is O(log n).


A binary search is the fastest and most easily implemented scheme, when it
can be used. This is because binary searches are at the core of any other
lookup scheme, including databases going through an index. By laying out a
sorted flat file and doing a low level binary search directly on that flat
file, much overhead is eliminated. If records are large, then a key and
offset value can be obtained from the flat file, and the offset used to
fetch actual data from another file. For variable length records obtain
offset and length from the first file and use to fetch from the second file.
Binary searches in this manner are greatly enhanced by the base operation
system which caches disk info, removing that load from the application
completely.

The original question was for alternatives to a ridiculously large hash
table. A small or reasonably sized hash table would certainly be better than
a binary search on disk.
 
E

Eric Sosman

Boudewijn said:
I believe the OP was talking about keeping the hashtable in memory, not the
entire database.

That's possible; the O.P. has not divulged much about
what he's trying to do. I thought he intended to keep the
whole thing in memory because he specifically mentioned the
row size; if he were intending just to store the keys and
a disk pointer he'd have talked about the key size. Also,
he seemed to have a speed fetish, and speed is incompatible
with I/O ...
Four million times 8 bytes (4-byte hashcode, 4-byte index) at a load of 75% is
still less than 40 MB. Where is your guess based on?

I may have guessed wrong; let me try again. The Hashtable
itself contains four million references to Map.Entry objects;
in a 64-bit JVM (on the load-all-the-data assumption) they'll
take eight bytes apiece for 32 MB.

Each Map.Entry object holds references to a key and a value
(sixteen bytes) plus some amount of overhead to express its own
"objectivity" -- I don't know how much that is, but let's assume
it's something like sixteen bytes. That's thirty-two bytes for
each of the four million Map.Entry objects, or 128 MB.

Grand total for metadata: about 160 MB, "just a trifle" less
than my original guess. Even if the Map.Entry overhead is more
like thirty-two bytes and even if the Hashtable itself stores
more than just the Map.Entry reference, the total won't come
close to my 7 GB blunder ... (Memo to self: Stop trying to do
back-of-the-envelope calculations on envelope-less E-mail ;-)
 
J

John C. Bollinger

David McDivitt wrote:

[Chris Uppal wrote:]
David McDivitt wrote: ["anthony" wrote:]
I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

That would be a binary search of already sorted keys. Probably no more
than seven or eight tests would be required to find any key.

If a sufficiently good hash function is available, and space is not a problem,
then a big hash table can /always/ beat a big binary lookup. Hashtables can be
made O(1) (with more-or-less any desired constant factor, trading space for
speed), whereas binary lookup is O(log n).

A binary search is the fastest and most easily implemented scheme, when it
can be used. This is because binary searches are at the core of any other
lookup scheme, including databases going through an index.

Your claim is much too strong. Binary searching is a widely used
technique, but it is manifestly false that a binary search is used in
any way for lookups against unsorted data. It is also not generally
true that binary searching is used as part of looking up data via a hash
table. Furthermore, I'm uncertain whether you did not understand Chris
comments about asymptotic complexity or whether you simply chose to
ignore them, but they are irrefutable: in the limit of many data and
given a reasonable hashing function, a hash table will *always*
outperform a binary search.

As for ease of implementation, generic binary searching and hash lookups
are both available in the Java platform library, so I don't see much
difference there. If anything, it's a bit trickier to ensure that the
underlying data is and remains sorted (for a binary search) than it is
to just add objects to a HashMap.
 
B

Bjorn Borud

["William Brogden" <[email protected]>]
|
| If you just need to verify that the name is known, look into the
| extremely compact way that spell checkers store words. Essentially
| you would have a spell checker full of names.

yes, two things one may want to look up are Bloom Filters and Finite
State Automata.

-Bjørn
 
L

Lee Fesperman

David said:
A binary search is the fastest and most easily implemented scheme, when it
can be used. This is because binary searches are at the core of any other
lookup scheme, including databases going through an index. By laying out a
sorted flat file and doing a low level binary search directly on that flat
file, much overhead is eliminated. If records are large, then a key and
offset value can be obtained from the flat file, and the offset used to
fetch actual data from another file. For variable length records obtain
offset and length from the first file and use to fetch from the second file.
Binary searches in this manner are greatly enhanced by the base operation
system which caches disk info, removing that load from the application
completely.

Databases on disk going through an index (rather than a hash) will generally use a
b-tree scheme these days. B-tree is not really a binary search and is much faster for
searching disk, even more with a good cache. For example, maintaining the top of the
tree in memory yields significant improvements because they tend to be balanced (a big
issue with binary tree). They also don't normally sort the data records thus allowing
multiple indexes for the same data. Generally, b-tree is preferred because it allows
range searching and ordered sub-sets which hashing doesn't. All of this is more
appropriate for database purposes.
 
C

Chris Uppal

David said:
A binary search is the fastest and most easily implemented scheme, when it
can be used.

I don't know if you really mean this as it sounds, but taken literally it is
quite simply false.

This is because binary searches are at the core of any other
lookup scheme, including databases going through an index.

The same goes for this.

It /is/ true that tree-based indexing is the norm in high-performance DB
indexing, but note that:

a) the "binary" search in question isn't a binary chop over a sorted array --
it'll be a complex implementation of some sophisticated balanced tree
algorithm.

b) the reason for using a balanced tree for the index rather than hashing has
more to do with the efficiency with which elements can be added and removed,
than lookup efficiency. In fact a hashtable (with a good hash) will beat a
balanced tree implementation for lookup.

By laying out a
sorted flat file and doing a low level binary search directly on that flat
file, much overhead is eliminated.

Very inefficient. Comparing just the case of a hastable laid out as a large
flat file, vs the same data laid out in sort order and accessed via a binary
chop, the hashtable will be /far/ faster. The problem is that the sorted
version will require O(log n) disk seeks to find the wanted element, whereas
the hashtable will require 1 disk seek (typically) to do the same. Disk seeks
are /slow/, doing just 10 of them consumes a substantial fraction of a second.

OTOH, the hashtable would have to be larger, in order to keep hash collisions
to an acceptable minimum. But it's surprising how little extra space is needed
in this particular case. A hashtable with a load as high as 90% works fine for
this, which is not what you'd normally expect. It's because the access time is
dominated by disk-seek time, and one seek followed by a read of a block of data
from the disk will normally pull in quite a few keys (depending on key-size, of
course) and provided that the hash collision can be resolved without needing to
read another block, the cost of the extra comparisons is essentially zero.

Note that both implementations would suffer from the problem that inserting or
deleting records could be /very/ expensive. The sorted table would suffer
rather worse from this problem than the hashtable, and the jiggery-pokery
required to reduce the problems would be harder to implement for the sorted
table, but neither is really suitable for anything except very special
purposes.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,249
Members
46,848
Latest member
Graciela Mitchell

Latest Threads

Top