read text file byte by byte

sjdevnull · Dec 14, 2009

The OP hasn't told us what version of Python he's using on what OS. On
Windows, text mode will compress the end-of-line sequence into a single
"\n". In Python 3.x, f.read(1) will read one character, which may be more
than one byte depending on the encoding.

The 3.1 documentation specifies that file.read returns bytes:

file.read([size])
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.

Does it need fixing?

Dennis Lee Bieber · Dec 14, 2009

The 3.1 documentation specifies that file.read returns bytes:

file.read([size])
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.

Does it need fixing?

I'm still running 2.5 (Maybe next spring I'll see if all the third
party libraries I have exist in 2.6 versions)... BUT...

"... are returned as a string object..." Aren't "strings" in 3.x now
unicode? Which would imply, to me, that the interpretation of the
contents will not be plain bytes.

sjdevnull · Dec 14, 2009

The 3.1 documentation specifies that file.read returns bytes:

Click to expand...

file.read([size])
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.

Click to expand...

Does it need fixing?

Click to expand...

I'm still running 2.5 (Maybe next spring I'll see if all the third
party libraries I have exist in 2.6 versions)... BUT...

"... are returned as a string object..." Aren't "strings" in 3.x now
unicode? Which would imply, to me, that the interpretation of the
contents will not be plain bytes.

I'm not even concerned (yet) about how the data is interpreted after
it's read. First I'm trying to clarify what exactly gets read.

The post I was replying to said "In Python 3.x, f.read(1) will read
one character, which may be more than one byte depending on the
encoding."

That seems at odds with the documentation saying "Read at most size
bytes from the file"--the fact that it's documented to read "size"
bytes rather than "size" (possibly multibyte) characters is emphasized
by the later language saying that the underlying C fread() call may be
called enough times to read as close to size bytes as possible.

If the poster I was replying to is correct, it seems like a
documentation update is in order. As a long-time programmer, I would
be very surprised to make a call to f.read(X) and have it return more
than X bytes if I hadn't read this here.

Nobody · Dec 14, 2009

The 3.1 documentation specifies that file.read returns bytes:

Does it need fixing?

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

Nobody · Dec 14, 2009

It's still more efficient to read in blocks, even if you're going to
process the bytes one at a time.

That's fine for a file. If you're reading from a pipe, socket, etc, you
typically want to take what you can get when you can get it (although this
is easier said than done in Python), rather than waiting for a complete
"block". This is often a primary reason for choosing a stream cipher over
a block cipher, as it eliminates the need to add and remove padding for
intermittent data flows.

Gabriel Genellina · Dec 15, 2009

En Mon said:
There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

So basically this section [1] should not exist, or be completely rewritten?
At least the references to C stdio library seem wrong to me.

[1] http://docs.python.org/3.1/library/stdtypes.html#file-objects

sjdevnull · Dec 15, 2009

There are no file objects in 3.x.

Then the documentation definitely needs fixing; the excerpt I posted
earlier is from the 3.1 documentation's section about file objects:
http://docs.python.org/3.1/library/stdtypes.html#file-objects

Which begins:

"5.9 File Objects

File objects are implemented using C’s stdio package and can be
created with the built-in open() function. File objects are also
returned by some other built-in functions and methods, such as os.popen
() and os.fdopen() and the makefile() method of socket objects."

(It goes on to describe the read method's operation on bytes that I
quoted upthread.)

Sadly I'm not familiar enough with 3.x to suggest an appropriate edit.

Terry Reedy · Dec 15, 2009

En Mon said:
En Mon said:

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

Click to expand...

So basically this section [1] should not exist, or be completely rewritten?
At least the references to C stdio library seem wrong to me.

[1] http://docs.python.org/3.1/library/stdtypes.html#file-objects

I agree.
http://bugs.python.org/issue7508

Terry Jan Reedy

daved170 · Dec 15, 2009

converted to a single said:
Thank you all.
Dennis I really liked you solution for the issue but I have two
question about it:
1) My origin file is Text file and not binary

Click to expand...

Â Â Â Â Do you need to process the bytes in the file as they are? Or do you
accept changes in line-endings (M$ Windows "text" files use <cr><lf> as
line ending, but if you read it in Python as "text" <cr><lf> is

converted to a single said:

2) I need to read each time 1 byte. I didn't see that on your example
code.

Click to expand...

Â Â Â Â You've never explained why you need to READ 1 byte at a time, vs
reading a block (I chose 1KB) and processing each byte IN THE BLOCK.
After all, if you do use 1 byte I/O, your program is going to be very
slow, as each read is blocking (suspends) while asking the O/S for the
next character in the file (this depends upon the underlying I/O library
implementation -- I suspect any modern I/O system is still reading some
block size [256 to 4K] and then returning parts of that block as
needed). OTOH, reading a block at a time makes for one suspension and
then a lot of data to be processed however you want.

Â Â Â Â You originally stated that you want to "scramble" the bytes -- if
you mean to implement some sort of encryption algorithm you should know
that most of them work in blocks as the "key" is longer than one byte.

Â Â Â Â My sample reads in chunks, then the scramble function XORs each byte
with the corresponding byte in the supplied key string, finally
rejoining all the now individual bytes into a single chunk for
subsequent output.

Hi All,
As I read again your comments and the codes you posted I realize that
I was mistaken.
I don't need to read the file byte by byte. you all right. I do need
to scramble each byte. So I'll do as you said - I'll read blocks and
scramble each byte in the block.
And now for my last question in this subject.
Lets say that my file contains the following line: "Hello World".
I read it using the read(1024) as you suggested in your sample.
Now, how can I XOR it with 0xFF for example?
Thanks again
Dave

sjdevnull · Dec 15, 2009

So basically this section [1] should not exist, or be completely rewritten?
At least the references to C stdio library seem wrong to me.

Click to expand...

[1]http://docs.python.org/3.1/library/stdtypes.html#file-objects

Click to expand...

I agree.http://bugs.python.org/issue7508

Terry Jan Reedy

Thanks, Terry.

Nobody · Dec 16, 2009

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

Click to expand...

So basically this section [1] should not exist, or be completely rewritten?

It should probably be changed to refer to "the file interface" or similar,
rather than removed altogether.

The io documentation may be unnecessary detail for many users. It would be
better to provide a more general overview, with a link to the io
package for those interested in the precise details. Also, removing the
section on "file objects" altogether is likely to be unhelpful for people
migrating from 2.x to 3.x.

csv read _csv.Error: line contains NULL byte	5	Mar 21, 2014
AES Encryption of byte array	0	Mar 1, 2011
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 19, 2024
read and write the same text file	2	Mar 9, 2013
Merging byte arrays	0	Apr 11, 2009
write byte array to file	15	Jun 11, 2008
how to add pad byte	11	Mar 25, 2012
Php combine identical lines in text file	4	Oct 11, 2023

read text file byte by byte

sjdevnull

Dennis Lee Bieber

sjdevnull

Nobody

Nobody

Gabriel Genellina

sjdevnull

Terry Reedy

daved170

sjdevnull

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads