Python 3.2 bug? Reading the last line of a file

T

tkpmep

The following function that returns the last line of a file works
perfectly well under Python 2.71. but fails reliably under Python 3.2.
Is this a bug, or am I doing something wrong? Any help would be
greatly appreciated.


import os

def lastLine(filename):
'''
Returns the last line of a file
file.seek takes an optional 'whence' argument which allows you
to
start looking at the end, so you can just work back from there
till
you hit the first newline that has anything after it
Works perfectly under Python 2.7, but not under 3.2!
'''
offset = -50
with open(filename) as f:
while offset > -1024:
offset *= 2
f.seek(offset, os.SEEK_END)
lines = f.readlines()
if len(lines) > 1:
return lines[-1]

If I execute this with a valid filename fn. I get the following error
message:
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
lastLine(fn)
File "<pyshell#11>", line 13, in lastLine
f.seek(offset, os.SEEK_END)
io.UnsupportedOperation: can't do nonzero end-relative seeks

Sincerely

Thomas Philips
 
M

MRAB

The following function that returns the last line of a file works
perfectly well under Python 2.71. but fails reliably under Python 3.2.
Is this a bug, or am I doing something wrong? Any help would be
greatly appreciated.


import os

def lastLine(filename):
'''
Returns the last line of a file
file.seek takes an optional 'whence' argument which allows you
to
start looking at the end, so you can just work back from there
till
you hit the first newline that has anything after it
Works perfectly under Python 2.7, but not under 3.2!
'''
offset = -50
with open(filename) as f:
while offset> -1024:
offset *= 2
f.seek(offset, os.SEEK_END)
lines = f.readlines()
if len(lines)> 1:
return lines[-1]

If I execute this with a valid filename fn. I get the following error
message:
Traceback (most recent call last):
File "<pyshell#12>", line 1, in<module>
lastLine(fn)
File "<pyshell#11>", line 13, in lastLine
f.seek(offset, os.SEEK_END)
io.UnsupportedOperation: can't do nonzero end-relative seeks
You're opening the file in text mode, and seeking relative to the end
of the file is not allowed in text mode, presumably because the file
contents have to be decoded, and, in general, seeking to an arbitrary
position within a sequence of encoded bytes can have undefined results
when you attempt to decode to Unicode starting from that position.

The strange thing is that you _are_ allowed to seek relative to the
start of the file.

Try opening the file in binary mode and do the decoding yourself,
catching the DecodeError exceptions if/when they occur.
 
I

Ian Kelly

You're opening the file in text mode, and seeking relative to the end
of the file is not allowed in text mode, presumably because the file
contents have to be decoded, and, in general, seeking to an arbitrary
position within a sequence of encoded bytes can have undefined results
when you attempt to decode to Unicode starting from that position.

The strange thing is that you _are_ allowed to seek relative to the
start of the file.

I think that with text files seek() is only really meant to be called
with values returned from tell(), which may include the decoder state
in its return value.
 
M

MRAB

I think that with text files seek() is only really meant to be called
with values returned from tell(), which may include the decoder state
in its return value.

What do you mean by "may include the decoder state in its return value"?

It does make sense that the values returned from tell() won't be in the
middle of an encoded sequence of bytes.
 
T

tkpmep

Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:

2.7.1
Name 31/12/2009 0 0 0

3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'

if, under 2.7.1 I read the file in text mode and write['Name', '31/12/2009', '0', '0', '0\n']

but under 3.2, with its binary read, I getTraceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
x.split('\t')
TypeError: Type str doesn't support the buffer API

If I remove the '\t', the split now works and I get a list of bytes
literals[b'Name', b'31/12/2009', b'0', b'0', b'0']

Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?

Sincerely

Thomas Philips
 
M

MRAB

Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:

2.7.1
Name 31/12/2009 0 0 0

3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'

if, under 2.7.1 I read the file in text mode and write['Name', '31/12/2009', '0', '0', '0\n']

but under 3.2, with its binary read, I getTraceback (most recent call last):
File "<pyshell#26>", line 1, in<module>
x.split('\t')
TypeError: Type str doesn't support the buffer API

If I remove the '\t', the split now works and I get a list of bytes
literals[b'Name', b'31/12/2009', b'0', b'0', b'0']

Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?
x.split('\t') tries to split on '\t', a string (str), but x is a
bytestring (bytes).

Do x.split(b'\t') instead.
 
E

Ethan Furman

Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:

3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'

under 3.2, with its binary read, I get
--> x.split('\t')
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
x.split('\t')
TypeError: Type str doesn't support the buffer API

You are trying to split a bytes object with a str object -- the two are
not compatible. Try splitting with the bytes object b'\t'.

~Ethan~
 
E

Ethan Furman

MRAB said:
Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:

2.7.1
Name 31/12/2009 0 0 0

3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'

if, under 2.7.1 I read the file in text mode and write
x = lastLine(fn)
I can then cleanly split the line to get its contents
x.split('\t')
['Name', '31/12/2009', '0', '0', '0\n']

but under 3.2, with its binary read, I get
x.split('\t')
Traceback (most recent call last):
File "<pyshell#26>", line 1, in<module>
x.split('\t')
TypeError: Type str doesn't support the buffer API

If I remove the '\t', the split now works and I get a list of bytes
literals
x.split()
[b'Name', b'31/12/2009', b'0', b'0', b'0']

Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?
x.split('\t') tries to split on '\t', a string (str), but x is a
bytestring (bytes).

Do x.split(b'\t') instead.

<nitpick>
Instances of the bytes class are more appropriately called 'bytes
objects' rather than 'bytestrings' as they are really lists of integers.
Accessing a single element of a bytes object does not return a bytes
object, but rather the integer at that location; i.e.

--> b'xyz'[1]
121

Contrast that with the str type where

--> 'xyz'[1]
'y'
</nitpick>

~Ethan~
 
I

Ian Kelly

What do you mean by "may include the decoder state in its return value"?

It does make sense that the values returned from tell() won't be in the
middle of an encoded sequence of bytes.

If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes. For example:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data

The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:
'\ufeff\u0302a'

But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data. However:
'\u0302a'

And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state. Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent. This is just a
demonstration of how the seek() value can include decoder state.

Cheers,
Ian
 
J

Jussi Piitulainen

Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?

You can split on b'\t' to get a list of byteses, which you can then
decode if you want them as strings.

You can decode the bytes to get a string and then split on '\t' to get
strings.
b'tic\ttac\ttoe'.split(b'\t') [b'tic', b'tac', b'toe']
b'tic\ttac\ttoe'.decode('utf-8').split('\t')
['tic', 'tac', 'toe']
 
T

tkpmep

This is exactly what I want to do - I can then pick up various
elements of the list and turn them into floats, ints, etc. I have not
ever used decode, and will look it up in the docs to better understand
it. I can't thank everyone enough for the generous serving of help and
guidance - I certainly would not have discovered all this on my own.

Sincerely


Thomas Philips
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top