I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.
STILL no specifics ... Well, I'll do what I can.
"Within the confines of the language" -- well, you've
already said you want to use a HashTable, so that's that.
HashMap might be better -- or not; depends on what you're
doing, and I still don't know enough.
40 GB of "raw data" plus a few more GB of object
references and other such "metadata" -- this implies a
JVM with 64-bit addressing.
You'll also need about 64 GB of RAM to hold all that
data, metadata, your classes, the JVM, and the O/S. The
machine isn't going to do much of anything besides serving
up the data for you.
You'll probably need a 64-bit O/S to manage that much
memory -- it's possible for an O/S to manage more memory
than it can address, but that style of thing seems to have
fallen out of favor. Solaris, AIX, some Linux distros,
maybe that brand-new Windows (if you can find a 64-bit
JVM for it). Dunno about Mac; dunno about *BSD; not sure
about zLinux. OS/2 fans need not apply.
Alternatively, you could break up the data into, say,
sixteen chunks of a quarter-million rows each (2.5 GB)
and spread the load across sixteen machines with 4 GB of
RAM running 32-bit JVMs and the O/S of your choice. A
seventeenth machine could field the query and route it,
or you could submit each query to all sixteen servers and
"batch" the results. This could be pretty fast if you
tie the machines together in something like an Infiniband
fabric.
Still not fast enough? Okay: Deploy a hundred such
machines each handling 40,000 rows, and architect the
routing for high parallelism. Or use a hundred machines
each handling 400,000 rows (so each row appears on ten
different machines) so you can route each query to the
least-busy machine that can satisfy it.
Still not enough? No problem: Break open the piggy
bank, and Sun or IBM or SGI or Cray or somebody will build
you a great big computing grid populated with the biggest,
baddest iron they make. It'll help if you live near a
hydroelectric or nuclear power plant; this solution may
require a little more electricity than ordinary office
wiring can supply.
Do you see yet, andrew, why "What's the fastest" is the
wrong question? You've removed all other constraints, so
you get answers that may be in some sense "right" but are
in no sense "useful" -- I think I probably exceeded your
likely budget several paragraphs ago. The right question
is something like "How do I keep the average (or worst case,
or 90th percentile) response below X milliseconds, and can
it be done with fewer than Y systems on a budget of Z?"
That's something people can deal sensibly with -- but these
"The sky's the limit" questions get nowhere, and rapidly.