Implementing file reading in C/Python

J

Johannes Bauer

Marc said:
Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes:

Ah, ok... when implementing your suggestions int he other post, I did
not get such a drastic performance increase. I really will have a look
at it and try to locate where I'm wasting the time.

Thanks a lot,
Kind regards,
Johannes
 
J

Johannes Bauer

mk said:
This probably doesn't contribute much, but have you tried using Python
profiler? You might have *something* wrong that eats up a lot of time in
the code.

No - and I've not known there was a profiler yet have found anything
meaningful (there seems to be an profiling C interface, but that won't
get me anywhere). Is that a seperate tool or something? Could you
provide a link?
The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Yup, that was about what I had expected (and what I could well live
with, it's a tradeoff).

Thanks,
Kind regards,
Johannes
 
R

rurpy

mk said:
The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Yup, that was about what I had expected (and what I could well live
with, it's a tradeoff).

The rule-of-thumb I use is that Python is generally 5 to 50 times
slower than C. It is considered blasphemy to say it in this group,
but Python is slow. It does of course have many compensating
advantages
that make using it advantageous when runtime speed is not of primary
importance.
 
M

MRAB

Marc said:
As this was horribly slow (20 Minutes for a 2GB file) I coded the whole
thing in C also:

Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes:

#!/usr/bin/env python
from __future__ import division, with_statement
import os
import sys
from collections import defaultdict
from functools import partial
from itertools import imap


def iter_max_values(blocks, block_count):
for i, block in enumerate(blocks):
histogram = defaultdict(int)
for byte in block:
histogram[byte] += 1

yield max((count, byte)
for value, count in histogram.iteritems())[1]
[snip]
Would it be faster if histogram was a list initialised to [0] * 256?
 
B

bearophileHUGS

Johannes Bauer, I was about to start writing a faster version. I think
with some care and Psyco you can go about as 5 times slower than C or
something like that.

To do that you need to use almost the same code for the C version,
with a list of 256 ints for the frequencies, not using max() but a
manual loop, not using itertools or generators, maybe splitting code
in two functions to allow Psyco to optimize better, maybe using
another array("...") for the frequences too. The data can be read into
an array.array("B"), and so on.

But I think all this work is a waste of time. I like Python, but that
C code, after some cleaning and polishing looks fine for this job. Of
course there are other languages that may give you a little nicer code
for this program, like D, and there may be ways to use numpy too to
speed up the computation of the mode, but they don't look so much
important this time.

Bye,
bearophile
 
S

Sion Arrowsmith

Grant Edwards said:
If I were you, I'd try mmap()ing the file instead of reading it
into string objects one chunk at a time.

You've snipped the bit further on in that sentence where the OP
says that the file of interest is 2GB. Do you still want to try
mmap'ing it?
 
M

Marc 'BlackJack' Rintsch

Marc said:
def iter_max_values(blocks, block_count):
for i, block in enumerate(blocks):
histogram = defaultdict(int)
for byte in block:
histogram[byte] += 1

yield max((count, byte)
for value, count in histogram.iteritems())[1]
[snip]
Would it be faster if histogram was a list initialised to [0] * 256?

Don't know. Then for every byte in the 2 GiB we have to call `ord()`.
Maybe the speedup from the list compensates this, maybe not.

I think that we have to to something with *every* byte of that really
large file *at Python level* is the main problem here. In C that's just
some primitive numbers. Python has all the object overhead.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Levenshtein distance using linked lists? That's novel. Care to
divulge?

And if C++ is using linked lists and Python isn't, it's not really the
same algorithm, is it?

Cheers,
John
 
R

Rhamphoryncus

Marc 'BlackJack' Rintsch wrote:
def iter_max_values(blocks, block_count):
    for i, block in enumerate(blocks):
        histogram = defaultdict(int)
        for byte in block:
            histogram[byte] += 1
        yield max((count, byte)
                  for value, count in histogram.iteritems())[1]
[snip]
Would it be faster if histogram was a list initialised to [0] * 256?

Don't know.  Then for every byte in the 2 GiB we have to call `ord()`..  
Maybe the speedup from the list compensates this, maybe not.

I think that we have to to something with *every* byte of that really
large file *at Python level* is the main problem here.  In C that's just
some primitive numbers.  Python has all the object overhead.

struct's B format might help here. Also, struct.unpack_from could
probably be combined with mmap to avoid copying the input. Not to
mention that the 0..256 ints are all saved and won't be allocated/
deallocated.
 
S

Sion Arrowsmith

Grant Edwards said:
Sure. The larger the file, the more you gain from mmap'ing it.
2GB should easily fit within the process's virtual memory
space.

Assuming you're in a 64bit world. Me, I've only got 2GB of address
space available to play in -- mmap'ing all of it out of the question.

But I supposed that mmap'ing it chunk at a time instead of reading
chunk at a time might be worth considering.
 
S

sturlamolden

You've snipped the bit further on in that sentence where the OP
says that the file of interest is 2GB. Do you still want to try
mmap'ing it?

Python's mmap object does not take an offset parameter. If it did, one
could mmap smaller portions of the file.
 
S

Sion Arrowsmith

In case the cancel didn't get through:

Sion Arrowsmith said:
Assuming you're in a 64bit world. Me, I've only got 2GB of address
space available to play in -- mmap'ing all of it out of the question.

And today's moral is: try it before posting. Yeah, I can map a 2GB
file no problem, complete with associated 2GB+ allocated VM. The
addressing is clearly not working how I was expecting it too.
 
S

sturlamolden

And today's moral is: try it before posting. Yeah, I can map a 2GB
file no problem, complete with associated 2GB+ allocated VM. The
addressing is clearly not working how I was expecting it too.

The virtual memory space of a 32 bit process is 4 GB.
 
H

Hrvoje Niksic

sturlamolden said:
Python's mmap object does not take an offset parameter. If it did, one
could mmap smaller portions of the file.

As of 2.6 it does, but that might not be of much use if you're using
2.5.x or earlier. If you speak Python/C and really need offset, you
could backport the mmap module from 2.6 and compile it under a
different name for 2.5.
 
S

Steve Holden

sturlamolden said:
The virtual memory space of a 32 bit process is 4 GB.
I believe, though, that in some environments 2GB of that is mapped onto
the operating system, to allow system calls to access OS memory
structures without any VM remapping being required - see

http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx.

Things have, however, improved if we are to believe what we read in

http://www.tenouk.com/WinVirtualAddressSpace.html

The very idea of mapping part of a process's virtual address space onto
an area in which "low-level system code resides, so writing to this
region may corrupt the system, with potentially catastrophic
consequences" seems to be asking for trouble to me. It's surprising
things used to don't go wrong with Windows all the time, really. Oh,
wait a minute, they did, didn't they? Still do for that matter ...

getting-sicker-of-vista-by-the-minute-ly yr's - steve
 
M

Marc 'BlackJack' Rintsch

The very idea of mapping part of a process's virtual address space onto
an area in which "low-level system code resides, so writing to this
region may corrupt the system, with potentially catastrophic
consequences" seems to be asking for trouble to me.

That's why those regions are usually "write protected" and "no execution
allowed" from the code in the user area of the virtual address space.

Ciao,
Marc 'BlackJack' Rintsch
 
D

David Bolen

Johannes Bauer said:
Yup, I changed the Python code to behave the same way the C code did -
however overall it's not much of an improvement: Takes about 15 minutes
to execute (still factor 23).

Not sure this is completely fair if you're only looking for a pure
Python solution, but to be honest, looping through a gazillion
individual bytes of information sort of begs for trying to offload
that into a library that can execute faster, while maintaining the
convenience of Python outside of the pure number crunching.

I'd assume numeric/numpy might have applicable functions, but I don't
use those libraries much, whereas I've been using OpenCV recently for
a lot of image processing work, and it has matrix/histogram support,
which seems to be a good match for your needs.

For example, assuming the OpenCV library and ctypes-opencv wrapper, add
the following before the file I/O loop:

from opencv import *

# Histogram for each file chunk
hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)])

then, replace (using one of your posted methods as a sample):

datamap = { }
for i in data:
datamap = datamap.get(i, 0) + 1

array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True)
most = ord(array[0][1])

with:

matrix = cvMat(1, len(data), CV_8UC1, data)
cvCalcHist([matrix], hist)
most = cvGetMinMaxHistValue(hist,
min_val = False, max_val = False,
min_idx = False, max_idx = True)

should give you your results in a fraction of the time. I didn't run
with a full size data file, but for a smaller one using smaller chunks
the OpenCV varient ran in about 1/10 of the time, and that was while
leaving all the other remaining Python code in place.

Note that it may not be identical results to some of your other
methods in the case of multiple values with the same counts, as the
OpenCV histogram min/max call will always pick the lower value in such
cases, whereas some of your code (such as above) will pick the upper
value, or your original code depended on the order of information
returned by dict.items.

This sort of small dedicated high performance choke point is probably
also perfect for something like Pyrex/Cython, although that would
require a compiler to build the extension for the histogram code.

-- David
 
M

mk

John said:
The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.
Levenshtein distance using linked lists? That's novel. Care to
divulge?

I meant: using linked lists to store words that are compared. I found
using vectors was slow.

Regards,
mk
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,093
Messages
2,570,607
Members
47,226
Latest member
uatas12

Latest Threads

Top