I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
Well, 200mb isn't all that big these days. But it's easy to code:
# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()
and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:
# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.
Ok, I'll have a go (only tested slightly ;-)
... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...
20 lines from the tail of november's python-dev archive
lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.
Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url :
http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp
Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.
Regards,
Bengt Richter