PJH> The startup time is O(n). For most other data structures it's worse (for
PJH> trees it's O(n * log n), for example). The insertion time for each
PJH> element may be a bit long (on my test system, inserting a hash member
PJH> with a short key takes about 2µs - 10 times as long as inserting an
PJH> array member), but that's the same for small hashes as for large hashes,
PJH> so it's not an argument against large hashes specifically.
PJH> Of course loading a hash with 100 million members into memory and then
PJH> doing a single lookup would be silly. But again that's true for any data
PJH> structure. The time to build the data structure must amortize itself
PJH> over the lifetime of the structure. So a huge hash makes only sense if
PJH> the time to build it is small compared to the time saved by doing fast
PJH> lookups. Adding 100 million records into a database takes even longer
PJH> (a *lot* longer), but if it lives long enough it may eventually amortize
PJH> itself.
A tiny MySQL database will happily chug through the same amount of
records as a 100GB hash, have very cheap startup time once the records
are in persistent storage, will be accessible from multiple processes
and network locations, and have other nice properties like better
queries and deterministic enumeration. I don't buy your argument, which
boils down to "fast lookups are worth a lot of trouble, like locking
yourself down to a single process, starting up slowly every time, and
using up a lot of memory." It doesn't make technical or business sense
to propose it as a solution unless you're certain there is no need for a
database and anticipate no need for long-term maintenance or extensions.
PJH> You can also walk the whole hash in hash order. If you have different
PJH> requirements, don't use a hash. But again that's mostly independent of
PJH> the hash size.
Well, no. The hash size can make it impossible to construct a list of
the keys if memory runs out. So it certainly matters and you have to be
VERY careful to avoid making a copy of the keys. A trie, by comparison,
can be comfortably walked without using much memory.
PJH> You forgot the that it may also hurt a lot. So the advice to always
PJH> partition large hashes without looking at the specific use case may
PJH> backfire badly.
I think all along I've shown great willingness to look at specific use cases.
PJH> That doesn't have much to do with the scalability of Perl hashes (within
PJH> RAM limits). You delegated a task to a program written for that task in
PJH> C instead of writing it in Perl and discovered that it was faster at it
PJH> than your Perl program. Big surprise.
But Peter, we're talking about how to approach a problem here. The tool
matters, certainly: you want to delegate to tools written for managing
large amounts of data. Perl hashes are not such a tool. They are a
general-purpose storage structure.
PJH> Other than that I suspect that you just ran out of RAM. Perl is a memory
PJH> hog and adds a large overhead. But again, that's perl's weakness in
PJH> general and not specific to hashes (much less large hashes).
I think you'll find most hash table implementations in any language have
that problem. They must overallocate by quite a bit, for instance, to
amortize the malloc costs.
PJH> There are use cases where hashes are appropriate and some where
PJH> they are not. This depends on a lot of factors but the the hash
PJH> size is of relatively small importance.
I agree as long as the resulting hash size is manageable in terms of the
specific needs of the user. But when it becomes a problem (AKA the
point at which the CompSci major calls the CompEng major to complain the
system is slow) it's a *big* problem. So I'm very suspicious of large
hash tables.
PJH> In fact, hashes scale better than almost any other data structure.
That's definitely not true. They make terrible arrays
PJH> I strongly disagree with "small hashes good, large hashes bad".
I don't think I said that. I use verbs, not grunts.
Ted