readlines() reading incorrect number of lines?

Wojciech Gryc · Dec 20, 2007

Hi,

I'm currently using Python to deal with a fairly large text file (800
MB), which I know has about 85,000 lines of text. I can confirm this
because (1) I built the file myself, and (2) running a basic Java
program to count lines yields a number in that range.

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything --
it seems the end of the loop was reached, and the code was executed
successfully.

I'm baffled and confused, and would be grateful for any advice as to
what I'm doing wrong, or why this may be happening.

Thank you,
Wojciech Gryc

John Machin · Dec 20, 2007

Hi,

I'm currently using Python to deal with a fairly large text file (800
MB), which I know has about 85,000 lines of text. I can confirm this
because (1) I built the file myself, and (2) running a basic Java
program to count lines yields a number in that range.

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything --
it seems the end of the loop was reached, and the code was executed
successfully.

I'm baffled and confused, and would be grateful for any advice as to
what I'm doing wrong, or why this may be happening.

What platform, what version of python?

One possibility: you are running this on Windows and the file contains
Ctrl-Z aka chr(26) aka '\x1a'.

Wojciech Gryc · Dec 20, 2007

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about
\x1a -- there's a few lines that definitely have some strange
character sequences, so this would make sense... Would you happen to
know how I can actually fix this (e.g. replace the character)? Since
Python doesn't see the rest of the file, I don't even know how to get
to it to fix the problem... Due to the nature of the data I'm working
with, manual editing is also not an option.

Thanks,
Wojciech

John Machin · Dec 20, 2007

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about
\x1a -- there's a few lines that definitely have some strange
character sequences, so this would make sense... Would you happen to
know how I can actually fix this (e.g. replace the character)? Since
Python doesn't see the rest of the file, I don't even know how to get
to it to fix the problem... Due to the nature of the data I'm working
with, manual editing is also not an option.

Please don't top-post.

Quick hack to remove all occurrences of '\x1a' (untested):

fin = open('old_file', 'rb') # note b BINARY
fout = open('new_file', 'wb')
blksz = 1024 * 1024
while True:
blk = fin.read(blksz)
if not blk: break
fout.write(blk.replace('\x1a', ''))
fout.close()
fin.close()

You may however want to investigate the "strange character sequences"
that have somehow appeared in your file after you built it
yourself

HTH,
John

Steven D'Aprano · Dec 20, 2007

[Fixing top-posting.]

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
-- there's a few lines that definitely have some strange character
sequences, so this would make sense... Would you happen to know how I
can actually fix this (e.g. replace the character)? Since Python doesn't
see the rest of the file, I don't even know how to get to it to fix the
problem... Due to the nature of the data I'm working with, manual
editing is also not an option.

Thanks,
Wojciech

Open the file in binary mode:

open(filename, 'rb')

and Windows should do no special handling of Ctrl-Z characters.

John Machin · Dec 20, 2007

[Fixing top-posting.]

Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
-- there's a few lines that definitely have some strange character
sequences, so this would make sense... Would you happen to know how I
can actually fix this (e.g. replace the character)? Since Python doesn't
see the rest of the file, I don't even know how to get to it to fix the
problem... Due to the nature of the data I'm working with, manual
editing is also not an option.

Click to expand...

Thanks,
Wojciech

Click to expand...

Open the file in binary mode:

open(filename, 'rb')

and Windows should do no special handling of Ctrl-Z characters.

I don't know whether it's a bug or a feature or just a dark corner,
but using mode='rU' does no special handling of Ctrl-Z either.

x = 'foo\r\n\x1abar\r\n'
f = open('udcray.txt', 'wb')
f.write(x)
f.close()
open('udcray.txt', 'r').readlines() ['foo\n']
open('udcray.txt', 'rU').readlines() ['foo\n', '\x1abar\n']
for line in open('udcray.txt', 'rU'):

Click to expand...

Click to expand...

.... print repr(line)
....
'foo\n'
'\x1abar\n'
Using 'rU' should make the OP's task of finding the strange character
sequences a bit easier -- he won't have to read a block at a time and
worry about the guff straddling a block boundary.

Gerry · Dec 21, 2007

Something I've occasionally found helpful with problem text files is
to build a histogram of character counts, something like this:

"""
chist.py
print a histogram of character frequencies in a nemed input file
"""

import sys

whitespace = ' \t\n\r\v\f'
lowercase = 'abcdefghijklmnopqrstuvwxyz'
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
letters = lowercase + uppercase
ascii_lowercase = lowercase
ascii_uppercase = uppercase
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + letters + punctuation

try:
fname = sys.argv[1]
except:
print "usage is chist yourfilename"
sys.exit()

chars = {}

f = open (fname, "rb")
lines = f.readlines()
for line in lines:
for c in line:
try:
chars[ord(c)] += 1
except:
chars[ord(c)] = 1

ords = chars.keys()
ords.sort()

for o in ords:
if chr(o) in printable:
c = chr(o)
else:
c = "UNP"

print "%5d %-5s %10d" % (o, c, chars[o])
print "_" * 50

Gerry

[Fixing top-posting.]

Click to expand...

Open the file in binary mode:

Click to expand...

open(filename, 'rb')

Click to expand...

and Windows should do no special handling of Ctrl-Z characters.

Click to expand...

I don't know whether it's a bug or a feature or just a dark corner,
but using mode='rU' does no special handling of Ctrl-Z either.

x = 'foo\r\n\x1abar\r\n'
f = open('udcray.txt', 'wb')
f.write(x)
f.close()
open('udcray.txt', 'r').readlines() ['foo\n']
open('udcray.txt', 'rU').readlines()

Click to expand...

Click to expand...

['foo\n', '\x1abar\n']>>> for line in open('udcray.txt', 'rU'):

... print repr(line)
...
'foo\n'
'\x1abar\n'

Using 'rU' should make the OP's task of finding the strange character
sequences a bit easier -- he won't have to read a block at a time and
worry about the guff straddling a block boundary.

Gabriel Genellina · Dec 27, 2007

whitespace = ' \t\n\r\v\f'
lowercase = 'abcdefghijklmnopqrstuvwxyz'
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
letters = lowercase + uppercase
ascii_lowercase = lowercase
ascii_uppercase = uppercase
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + letters + punctuation

You do know that most -if not all- of those sets are available as
attributes of the string module, don't you?
You could replace all the lines above with: from string import printable,
as it's the only constant used.

readlines with line number support?	7	May 14, 2008
Couting the number of lines of code of a python program	0	Jan 5, 2013
Number of objects grows unbouned...Memory leak	1	May 3, 2014
Incorrect number of bytes returned by getsockopt(socket.SOL_SOCKET,socket.TCP_INFO, buflen)	2	Dec 3, 2009
what happens when the file begin read is too big for all lines to beread with "readlines()"	17	Nov 19, 2005
Consolidate several lines of a CSV file with firewall rules	0	Oct 11, 2013
Consolidate several lines of a CSV file with firewall rules	5	Oct 11, 2013
Incorrect number of arguments	2	Jun 9, 2005

readlines() reading incorrect number of lines?

Wojciech Gryc

John Machin

Wojciech Gryc

John Machin

Steven D'Aprano

John Machin

Gerry

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads