R
Rainer Weikusat
Rainer Weikusat said:Rainer Weikusat said:Martijn Lievaart said:]
man sort
Indeed, I tried sort first, it works, it is more of a scalability
question really.
This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.
As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.
'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.
I suggest that you make some benchmarks before making blanket
statements like the one above.
On some random computer I just used for that, sorting a 1080M file
(4000000 lines)
Since I'm a curious person, I also tried this with the 'complete'
algorithm, namely, sort the lines, remove the IDs and concatenate the
results and something like
sort -k1 -S 50% mob-4 | perl -pe 'chop; s/^[^\t]+\t//;' >out
is actually drastically faster than any 'pure Perl' solution. But this
requires keeping the whole file in memory. As soon as sort can't do
that anymore, its performance becomes relatively abysmal while the
code which keeps only the IDs works decently on a larger dataset.
But this is nevertheless sort-of a ghost discussion: Something
'complete' which has been written in C will doubtlessly outperform the
pipeline easily.