read text file byte by byte

S

sjdevnull

The OP hasn't told us what version of Python he's using on what OS.  On  
Windows, text mode will compress the end-of-line sequence into a single  
"\n".  In Python 3.x, f.read(1) will read one character, which may be more  
than one byte depending on the encoding.

The 3.1 documentation specifies that file.read returns bytes:

file.read([size])
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.

Does it need fixing?
 
D

Dennis Lee Bieber

The 3.1 documentation specifies that file.read returns bytes:

file.read([size])
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.

Does it need fixing?

I'm still running 2.5 (Maybe next spring I'll see if all the third
party libraries I have exist in 2.6 versions)... BUT...

"... are returned as a string object..." Aren't "strings" in 3.x now
unicode? Which would imply, to me, that the interpretation of the
contents will not be plain bytes.
 
S

sjdevnull

The 3.1 documentation specifies that file.read returns bytes:
file.read([size])
    Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in non-
blocking mode, less data than was requested may be returned, even if
no size parameter was given.
Does it need fixing?

        I'm still running 2.5 (Maybe next spring I'll see if all the third
party libraries I have exist in 2.6 versions)... BUT...

        "... are returned as a string object..." Aren't "strings" in 3.x now
unicode? Which would imply, to me, that the interpretation of the
contents will not be plain bytes.

I'm not even concerned (yet) about how the data is interpreted after
it's read. First I'm trying to clarify what exactly gets read.

The post I was replying to said "In Python 3.x, f.read(1) will read
one character, which may be more than one byte depending on the
encoding."

That seems at odds with the documentation saying "Read at most size
bytes from the file"--the fact that it's documented to read "size"
bytes rather than "size" (possibly multibyte) characters is emphasized
by the later language saying that the underlying C fread() call may be
called enough times to read as close to size bytes as possible.

If the poster I was replying to is correct, it seems like a
documentation update is in order. As a long-time programmer, I would
be very surprised to make a call to f.read(X) and have it return more
than X bytes if I hadn't read this here.
 
N

Nobody

The 3.1 documentation specifies that file.read returns bytes:
Does it need fixing?

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.
 
N

Nobody

It's still more efficient to read in blocks, even if you're going to
process the bytes one at a time.

That's fine for a file. If you're reading from a pipe, socket, etc, you
typically want to take what you can get when you can get it (although this
is easier said than done in Python), rather than waiting for a complete
"block". This is often a primary reason for choosing a stream cipher over
a block cipher, as it eliminates the need to add and remove padding for
intermittent data flows.
 
G

Gabriel Genellina

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

So basically this section [1] should not exist, or be completely rewritten?
At least the references to C stdio library seem wrong to me.

[1] http://docs.python.org/3.1/library/stdtypes.html#file-objects
 
S

sjdevnull

There are no file objects in 3.x.

Then the documentation definitely needs fixing; the excerpt I posted
earlier is from the 3.1 documentation's section about file objects:
http://docs.python.org/3.1/library/stdtypes.html#file-objects

Which begins:

"5.9 File Objects

File objects are implemented using C’s stdio package and can be
created with the built-in open() function. File objects are also
returned by some other built-in functions and methods, such as os.popen
() and os.fdopen() and the makefile() method of socket objects."

(It goes on to describe the read method's operation on bytes that I
quoted upthread.)

Sadly I'm not familiar enough with 3.x to suggest an appropriate edit.
 
T

Terry Reedy

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

So basically this section [1] should not exist, or be completely rewritten?
At least the references to C stdio library seem wrong to me.

[1] http://docs.python.org/3.1/library/stdtypes.html#file-objects

I agree.
http://bugs.python.org/issue7508

Terry Jan Reedy
 
D

daved170

Thank you all.
Dennis I really liked you solution for the issue but I have two
question about it:
1) My origin file is Text file and not binary

        Do you need to process the bytes in the file as they are? Or do you
accept changes in line-endings (M$ Windows "text" files use <cr><lf> as
line ending, but if you read it in Python as "text" <cr><lf> is
converted to a single said:
2) I need to read each time 1 byte. I didn't see that on your example
code.

        You've never explained why you need to READ 1 byte at a time, vs
reading a block (I chose 1KB) and processing each byte IN THE BLOCK.
After all, if you do use 1 byte I/O, your program is going to be very
slow, as each read is blocking (suspends) while asking the O/S for the
next character in the file (this depends upon the underlying I/O library
implementation -- I suspect any modern I/O system is still reading some
block size [256 to 4K] and then returning parts of that block as
needed). OTOH, reading a block at a time makes for one suspension and
then a lot of data to be processed however you want.

        You originally stated that you want to "scramble" the bytes -- if
you mean to implement some sort of encryption algorithm you should know
that most of them work in blocks as the "key" is longer than one byte.

        My sample reads in chunks, then the scramble function XORs each byte
with the corresponding byte in the supplied key string, finally
rejoining all the now individual bytes into a single chunk for
subsequent output.

Hi All,
As I read again your comments and the codes you posted I realize that
I was mistaken.
I don't need to read the file byte by byte. you all right. I do need
to scramble each byte. So I'll do as you said - I'll read blocks and
scramble each byte in the block.
And now for my last question in this subject.
Lets say that my file contains the following line: "Hello World".
I read it using the read(1024) as you suggested in your sample.
Now, how can I XOR it with 0xFF for example?
Thanks again
Dave
 
N

Nobody

There are no file objects in 3.x. The file() function no longer
exists. The return value from open(), will be an instance of
_io.<something> depending upon the mode, e.g. _io.TextIOWrapper for 'r',
_io.BufferedReader for 'rb', _io.BufferedRandom for 'w+b', etc.

http://docs.python.org/3.1/library/io.html

io.IOBase.read() doesn't exist, io.RawIOBase.read(n) reads n bytes,
io.TextIOBase.read(n) reads n characters.

So basically this section [1] should not exist, or be completely rewritten?

It should probably be changed to refer to "the file interface" or similar,
rather than removed altogether.

The io documentation may be unnecessary detail for many users. It would be
better to provide a more general overview, with a link to the io
package for those interested in the precise details. Also, removing the
section on "file objects" altogether is likely to be unhelpful for people
migrating from 2.x to 3.x.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,888
Messages
2,569,964
Members
46,294
Latest member
HollieYork

Latest Threads

Top