A
aliman
Hi all,
Apologies I'm sure this has been asked many times, but I'm trying to
figure out the most efficient way to do a complex sort on very large
files.
I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.
What I'm having trouble figuring out is what to do when I want to sort
by one key ascending then another key descending (a "complex sort").
I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.
I also thought you could apply the complex sort to each chunk before
writing it to disk, so each chunk was completely sorted, but then the
heapq.merge wouldn't work properly, because afaik you can only give it
one key.
Any help much appreciated (I may well be missing something glaringly
obvious).
Cheers,
Alistair
[1] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
Apologies I'm sure this has been asked many times, but I'm trying to
figure out the most efficient way to do a complex sort on very large
files.
I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.
What I'm having trouble figuring out is what to do when I want to sort
by one key ascending then another key descending (a "complex sort").
I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.
I also thought you could apply the complex sort to each chunk before
writing it to disk, so each chunk was completely sorted, but then the
heapq.merge wouldn't work properly, because afaik you can only give it
one key.
Any help much appreciated (I may well be missing something glaringly
obvious).
Cheers,
Alistair
[1] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/