X
xhoster
Ted Zlatanov said:On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich
IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov
IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good
luck!
IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it
is IZ> read from disk (which would, IMO, imply the same amount of seeks
with IZ> database as with any other file-based operation).
Look, databases are optimized to store large amounts of data
efficiently.
For some not very general meanings of "efficiently", sure. They generally
expand the data quite a bit upon storage; they aren't very good at straight
retrieval unless you have just the right index structures in place and your
queries have a high selectivity; most of them put a huge amount of effort
into transactionality and concurrency which maybe not be needed here but
imposes a high overhead whether you use it or not. One of the major
gene-chip companies was very proud that in one of their upgrades, they
started using a database instead of plain files for storing the data. And
then their customers were very pleased when in a following upgrade they
abandoned that, and went back to using plain files for the bulk data and
using the database just for the small DoE metadata.
You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.
And I really doubt that any single database design is going to support
everything that anyone may ever want to do with the data, either.
IZ> One needs not a database, but a program with build-in caching
IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
IZ> imagemagick is mostly memory-based. On the other side of spectrum,
IZ> GIMP is based on tile-caching algorithms; if there were a way to
IZ> easily hook into this algorithm (with no screen display involved),
one IZ> could handle much larger datasets.
You and everyone else are overcomplicating this.
Rewrite the original input file for fixed-length records.
Actually, that is just what I initially did recommended.
Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.
I think you are missing the big picture. Once you make a seekable file
format, that probably does away with the need to transpose the data in the
first place--whatever operation you wanted to do with the transposition can
be probably be done on the seekable file instead.
Xho