efficient 'tail' implementation

s99999999s2003 · Dec 8, 2005

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

bonono · Dec 8, 2005

[email protected] said:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

I don't think this is a python specific issue but a generic problem for
all "file as byte stream" system. The problem is, "line" is not a
property of the file, but its content(some big iron system use
"records" for lines and can be addressed with O(1))

So the simplest is just read and drop until the one you want.

for x in f:
if x_is_what_I_want: something

If you really want, you can do the reverse lookup like this :

f.seek(0,EOF)
x = f.tell()

then loop byte by byte backward till you find you stuff. The is quite
cumbersome and may not be faster, depending on your content.

Mike Meyer · Dec 8, 2005

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.

Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

<mike

bonono · Dec 8, 2005

Mike said:
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Gerald Klix · Dec 8, 2005

As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

The following code snippets may be usefull:
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
continue
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )

Then you can use
mapping.rfind( os.linesep )
to find the end of the but last line and so on.

This is very fast, because nearly all work is done by are rfind, which
is implemented in C and the OS' paging logic.

HTH,
Gerald

Jerry Sievers · Dec 8, 2005

hi

I have a file which is very large eg over 200Mb , and i am going to
use python to code a "tail" command to get the last few lines of the
file. What is a good algorithm for this type of task in python for
very big files? Initially, i thought of reading everything into an
array from the file and just get the last few elements (lines) but
since it's a very big file, don't think is efficient. thanks

First of all, what makes you think that tail on your system isn't
already optinized for this? Devil's advocate here... I have no clue
really.

Anyway, if you must roll your own;

Determine some reasonable max line size,, multiply this by a few
larger than the numbers of lines that you want to tail, seek to EOF
and then back to the position in the file where this chunk would
start, get and split that chunk into lines and now output the last 3
or however many you need.

If the read fails to hold enough lines, seek back a bit further and do
same but you'll have to be prepared to concat the second and Nth last
chunks together.

Have fun

Nick Craig-Wood · Dec 8, 2005

Gerald Klix said:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

Excellent idea.

It'll blow up for large >2GB files on a 32bit OS though.

Nick Craig-Wood · Dec 8, 2005

Gerald Klix said:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

Actually mmap doesn't appear to have an rfind method :-(

Here is a tested solution using mmap using your code. Inefficient if
number of lines to be tailed is too big.

import os
import sys
import mmap

def main(nlines, filename):
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
return
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )
search = 1024
lines = []
while 1:
if search > length:
search = length
tail = mapping[length-search:]
lines = tail.split(os.linesep)
if len(lines) >= nlines or search == length:
break
search *= 2
lines = lines[-nlines-1:]
print "\n".join(lines)

if __name__ == "__main__":
if len(sys.argv) != 3:
print "Syntax: %s n file" % sys.argv[0]
else:
main(int(sys.argv[1]), sys.argv[2])

Neal Becker · Dec 8, 2005

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

You should look at pyinotify. I assume we're talking linux here.

Bengt Richter · Dec 9, 2005

[email protected] said:
[email protected] said:

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.

Click to expand...

Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

Ok, I'll have a go (only tested slightly ;-)
... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...

20 lines from the tail of november's python-dev archive
lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.

Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp

Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.

Regards,
Bengt Richter

Magnus Lycka · Dec 9, 2005

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

To read the last x bytes of a file, you could do:

Maybe that's a start. I didn't try it on a anything bigger than 16MB,
but it was more or less instantaneous for 16Megs.

If you want the last X lines and know that lines are no more than N
chars, f.seek(l-X*N); f.readlines()[-X:] should give you what you
need... (I think...I didn't test it.)

Colin J. Williams · Dec 9, 2005

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

If your file is seekable you could consider using file's seek function.

This would appear to give you a way of just working with the last, say 1mb.

Colin W.

Marius Gedminas · Dec 15, 2005

Magnus said:
To read the last x bytes of a file, you could do:

You don't need fstat/st_size, you can ask seek to move to an offset
relative to the end of the file:

My attempts in playing with tail-recursion in python	1	Aug 28, 2013
Python's tail -f implementation for remote files	1	May 28, 2008
tail recursion optimization	9	Jun 27, 2011
lists as an efficient implementation of large two-dimensionalarrays(!)	0	Feb 2, 2010
Any idea to emulate tail -f	5	May 5, 2009
Having trouble with tail -f standard input	3	Aug 21, 2008
Efficient way of looging in python	2	Apr 25, 2013
Red Black Tree implementation?	14	May 2, 2013

efficient 'tail' implementation

s99999999s2003

bonono

Mike Meyer

bonono

Gerald Klix

Jerry Sievers

Nick Craig-Wood

Nick Craig-Wood

Neal Becker

Bengt Richter

Magnus Lycka

Colin J. Williams

Marius Gedminas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads