Complex sort on big files

A

aliman

Hi all,

Apologies I'm sure this has been asked many times, but I'm trying to
figure out the most efficient way to do a complex sort on very large
files.

I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.

What I'm having trouble figuring out is what to do when I want to sort
by one key ascending then another key descending (a "complex sort").

I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.

I also thought you could apply the complex sort to each chunk before
writing it to disk, so each chunk was completely sorted, but then the
heapq.merge wouldn't work properly, because afaik you can only give it
one key.

Any help much appreciated (I may well be missing something glaringly
obvious).

Cheers,

Alistair

[1] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
 
P

Peter Otten

aliman said:
Apologies I'm sure this has been asked many times, but I'm trying to
figure out the most efficient way to do a complex sort on very large
files.

I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.

What I'm having trouble figuring out is what to do when I want to sort
by one key ascending then another key descending (a "complex sort").

I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.

I also thought you could apply the complex sort to each chunk before
writing it to disk, so each chunk was completely sorted, but then the
heapq.merge wouldn't work properly, because afaik you can only give it
one key.

You can make that key as complex as needed:
.... def __init__(self, obj):
.... self.asc = obj[1]
.... self.desc = obj[2]
.... def __cmp__(self, other):
.... return cmp(self.asc, other.asc) or -cmp(self.desc,
other.desc)
....
sorted(["abc", "aba", "bbb", "aaa", "aab"], key=Key)
['aab', 'aaa', 'abc', 'bbb', 'aba']

See also

http://docs.python.org/library/functools.html#functools.total_ordering
 
C

Chris Rebert

Hi all,

                I have 5 server machines that are using to process
information. I would like to write a quick server python script that
determines which of the machines are not in use. Any recommendations on
which python module I should use to detect if a machine is not performing
idle (ex. Some specific task is not running)?

Yes, psutil:
http://code.google.com/p/psutil/

os.getloadavg() may or may not also be useful to you:
http://docs.python.org/library/os.html#os.getloadavg

Cheers,
Chris
 
S

sturlamolden

I understand that sorts are stable, so I could just repeat the whole
sort process once for each key in turn, but that would involve going
to and from disk once for each step in the sort, and I'm wondering if
there is a better way.

I would consider using memory mapping the file and sorting it inline.
Sorting a binary file of bytes with NumPy is as easy as this:

import numpy as np
f = np.memmap(filename, mode='rwb', dtype=np.uint8)
f.sort(kind='quicksort')
del f

(You can define dtype for any C data type or struct.)

If the file is really big, use 64-bit Python.

With memory mapping you don't have to worry about processing the file
in chunks, because the operating
systems will take care of those details.

I am not sure how to achieve this (inline file sort) with standard
library mmap and timsort, so I'll leave that out.


Sturla
 
R

Roy Smith

Wow.

I was going to suggest using the unix command-line sort utility via
popen() or subprocess. My arguments were that it's written in C, has 30
years of optimizing in it, etc, etc, etc. It almost certainly has to be
faster than anything you could do in Python.

Then I tried the experiment. I generated a file of 1 million random
integers in the range 0 to 5000. I wrote a little sorting program:

numbers = [int(line) for line in open('numbers')]
numbers.sort()
for i in numbers:
print i

and ran it on my MacBook Pro (8 Gig, 2 x 2.4 GHz cores), Python 2.6.1.

$ time ./sort.py > py-sort
real 0m2.706s
user 0m2.491s
sys 0m0.057s

and did the same with the unix utility:

$ time sort -n numbers > cli-sort
real 0m5.123s
user 0m4.745s
sys 0m0.063s

Python took just about half the time. Certainly knocked my socks off.
Hard to believe, actually.
 
S

Steven D'Aprano

Roy said:
Wow.

I was going to suggest using the unix command-line sort utility via
popen() or subprocess. My arguments were that it's written in C, has 30
years of optimizing in it, etc, etc, etc. It almost certainly has to be
faster than anything you could do in Python.

Then I tried the experiment. I generated a file of 1 million random
integers in the range 0 to 5000. I wrote a little sorting program: [...]
Python took just about half the time. Certainly knocked my socks off.
Hard to believe, actually.

One million integers isn't very big. If each integer can fit in four-byte
long, that's less than 4MB. That's almost small enough to fit in your CPU's
cache, with room left over for the first few chapters of "War And Peace"
*wink*

So you're comparing Python's timsort, which is Awesome with a capital AWE
but only works on data that fits in memory, versus something which can also
work on files too big to fit into memory.

Try generating a twenty gigabyte file of data, and sort that. Actually,
don't, because just *reading it in* to Python will probably fail, and very
possibly lock your PC up for the duration.

Unix sort does an external R-Way merge sort: if you have more data than
memory, it slices the data up into a bunch of smaller pieces, each of which
will fit in memory, sorts each one to a temporary file, then merges the
lot. It does a lot more work on such big files, because it *takes* a lot
more work.

For something as small as one million numbers, chances are the Unix sort
falls back on a heapsort or a quicksort, which will be pretty fast, but it
ain't no timsort.

So yes, Python's timsort is awesome, but so is Unix's sort, just in
different ways.
 
S

sturlamolden

I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.

Or just memory map the file (mmap.mmap) and do an inline .sort() on
the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap
instead. If the file is large, use 64-bit Python. You don't have to
process the file in chunks as the operating system will take care of
those details.

Sturla
 
J

John Nagle

I've read the recipe at [1] and understand that the way to sort a
large file is to break it into chunks, sort each chunk and write
sorted chunks to disk, then use heapq.merge to combine the chunks as
you read them.

Or just memory map the file (mmap.mmap) and do an inline .sort() on
the bytearray (Python 3.2). With Python 2.7, use e.g. numpy.memmap
instead. If the file is large, use 64-bit Python. You don't have to
process the file in chunks as the operating system will take care of
those details.

Sturla

No, no, no. If the file is too big to fit in memory, trying to
page it will just cause thrashing as the file pages in and out from
disk.

The UNIX sort program is probably good enough. There are better
approaches, if you have many gigabytes to sort, (see Syncsort, which
is a commercial product) but few people need them.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,815
Latest member
treekmostly22

Latest Threads

Top