Possible read()/readline() bug?

M

Mike Kent

Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not being stupid.

I have a text file of fixed-length records I want to read in random
order. That file is being changed in real-time by another process,
and my process want to see the changes to the file. What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file. This indicates some sort of buffering and caching is going on.

Consider the following:

$ echo "hi" >foo.txt # Create my test file
$ python2.5 # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.

So, is this a bug, or am I being stupid?
 
P

pruebauno

Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not being stupid.

I have a text file of fixed-length records I want to read in random
order. That file is being changed in real-time by another process,
and my process want to see the changes to the file. What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file. This indicates some sort of buffering and caching is going on.

Consider the following:

$ echo "hi" >foo.txt # Create my test file
$ python2.5 # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'bye\n'

It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.

So, is this a bug, or am I being stupid?

This has to do how the OS file-system operates. This is equivalent of
doing:

echo "hi" >foo.txt
vi foo.txt
in another session type: echo "bye" > foo.txt

the text in the vi session doesn't change.

you can even type 'rm foo.txt' and vi will still have the text there.
 
P

pruebauno

Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not being stupid.
I have a text file of fixed-length records I want to read in random
order. That file is being changed in real-time by another process,
and my process want to see the changes to the file. What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file. This indicates some sort of buffering and caching is going on.
Consider the following:
$ echo "hi" >foo.txt # Create my test file
$ python2.5 # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
f = open('foo.txt') # Open my test file
f.seek(0) # Seek to the beginning of the file
f.readline() # Read the line, I get the data I expected 'hi\n'
# At this point, in another shell I execute 'echo "bye" >foo.txt'. 'foo.txt' now has been changed
# on the disk, and now contains 'bye\n'.
f.seek(0) # Seek to the beginning of the still-open file
f.readline() # Read the line, I don't get 'bye\n', I get the original data, which is no longer there. 'hi\n'
f.close() # Now I close the file...
f = open('foo.txt') # ... and reopen it
f.seek(0) # Seek to the beginning of the file
f.readline() # Read the line, I get the expected 'bye\n' 'bye\n'

It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.
So, is this a bug, or am I being stupid?

This has to do how the OS file-system operates. This is equivalent of
doing:

echo "hi" >foo.txt
vi foo.txt
in another session type: echo "bye" > foo.txt

the text in the vi session doesn't change.

you can even type 'rm foo.txt' and vi will still have the text there.

Actually disregard what I said. vi loads everything in memory. You
might want to try:

f = open('foo.txt','r',0)

and see if that fixes your problem.
 
T

Terry Reedy

Mike said:
Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not [missing something].

Good idea ;-). What you are missing is a rereading of the fine manual
to see what you missed the first time. I recommend this *whenever* you
are having a vexing problem.
I have a text file of fixed-length records I want to read in random
order. That file is being changed in real-time by another process,
and my process want to see the changes to the file. What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file. This indicates some sort of buffering and caching is going on.

In particular, for 2.x
"open( filename[, mode[, bufsize]])
....
The optional bufsize argument specifies the file's desired buffer size:
0 means unbuffered, 1 means line buffered, any other positive value
means use a buffer of (approximately) that size. A negative bufsize
means to use the system default, which is usually line buffered for tty
devices and fully buffered for other files. If omitted, the system
default is used.2.3 "

Give open('foo.txt', 'r', 0) a try and see if it makes a difference.

There is a slight change in 3.0. "Pass 0 to switch buffering off (only
allowed in binary mode)" I presume the restriction is because 't' mode
is for automatic decoding to unicode and buffering is used for that.

tjr
 
K

kdwyer

Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not being stupid.

I have a text file of fixed-length records I want to read in random
order. That file is being changed in real-time by another process,
and my process want to see the changes to the file. What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file. This indicates some sort of buffering and caching is going on.

Consider the following:

$ echo "hi" >foo.txt # Create my test file
$ python2.5 # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'bye\n'

It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.

So, is this a bug, or am I being stupid?

Hello Mike,

I'm guessing that this is not a bug. I'm no expert, but I'd guess
that the open(file, mode) function simply loads the file into memory,
and that further operations (such as seek or read) are performed on
the in-memory data rather than the data on disk. Thus changes to the
file are only observed after a fresh open operation.

This behaviour is probably enforced by the C library on the machine
that you are using. If you want to be able to pick up data changes
like this then you're better off using a database package that has
support for concurrent access, locking and transactions.

Cheers,

Kev
 
S

Steven D'Aprano

Mike said:
Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not [missing something].

Good idea ;-). What you are missing is a rereading of the fine manual
to see what you missed the first time. I recommend this *whenever* you
are having a vexing problem.

With respect Terry, I think what you have missed is the reason why the OP
thinks this is a bug. He's not surprised that buffering is going on:

"This indicates some sort of buffering and caching is going on."

but he thinks that the buffering should be discarded when you seek:

"It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access."


I think Mike has a point: if a cache is out of sync with the actual data,
then the cache needs to be thrown away. A bad cache is worse than no
cache at all.

Surely dealing with files that are being actively changed by other
processes is hard. I'm not sure that the solution is anything other than
"well, don't do that then". How do other programming languages and Unix
tools behave? (Windows generally only allows a single process to read or
write to a file at once.)

Additionally, I wonder whether what Mike is seeing is some side-effect of
file-system caching. Perhaps the bytes written to the file by echo are
only written to disk when the file is closed? I don't know, I'm just
hypothesizing.
 
T

Terry Reedy

Steven said:
Mike said:
Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not [missing something].
Good idea ;-). What you are missing is a rereading of the fine manual
to see what you missed the first time. I recommend this *whenever* you
are having a vexing problem.

With respect Terry, I think what you have missed is the reason why the OP
thinks this is a bug.

I think not. I read and responded carefully ;-) I stand by my answer:
the OP should read the doc and try buffer=0 to see if that solves his
problem.
He's not surprised that buffering is going on:
"This indicates some sort of buffering and caching is going on."

If one reads the open() doc section on buffering, one will *know* that
the reading is buffered and that this is very intentional, and that one
can turn it off.
but he thinks that the buffering should be discarded when you seek:

"It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek.

I don't think Python has any control over this, certainly not in a
platform independent way, and not after the file has been open.

For normal sane file reading, discarding after every seek would be very
wrong. Buffering is an *optional* efficiency measure which normally is
the right thing to do and so is the default but which can be disabled
when it is not IF ONE READS THE DOC.
Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access."

And he does not have to do such a thing.
I think Mike has a point: if a cache is out of sync with the actual data,
then the cache needs to be thrown away. A bad cache is worse than no
cache at all.

Right. I told him what to try. If *that* does not work, he can report
back.

Python is not doing the caching. This is OS stuff.
Surely dealing with files that are being actively changed by other
processes is hard.

Tail, which sequentially reads what a other process(es) sequentially
write, works fine.
I'm not sure that the solution is anything other than
"well, don't do that then".

Mixed random access is a different matter. There is a reason DBMSes run
file access through one process.
How do other programming languages and Unix
tools behave? (Windows generally only allows a single process to read or
write to a file at once.)

Additionally, I wonder whether what Mike is seeing is some side-effect of
file-system caching. Perhaps the bytes written to the file by echo are
only written to disk when the file is closed? I don't know, I'm just
hypothesizing.

When echo closes, I expect the disk block will be flushed, which means
added to the pool of blocks ready to be read or written when the disk
driver gets cpu time and gets around to any particular block. Depending
of the file system and driver, blocks may get sorted by disk address to
minimize inter-access seek times (the elevator algorithm).

Terry Jan Reedy
 
C

Carl Banks

Before I file a bug report against Python 2.5.2, I want to run this by
the newsgroup to make sure I'm not being stupid.

I have a text file of fixed-length records I want to read in random
order.  That file is being changed in real-time by another process,
and my process want to see the changes to the file.  What I'm seeing
is that, once I've opened the file and read a record, all subsequent
seeks to and reads of that same record will return the same data as
the first read of the record, so long as I don't close and reopen the
file.  This indicates some sort of buffering and caching is going on.

Consider the following:

$ echo "hi" >foo.txt  # Create my test file
$ python2.5              # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

I thought this might be a case where the shell unlinks foo.txt and
creates a new file... but it doesn't for me, and I still get the same
behavior as you. It is indeed the buffering that's causing this.
'bye\n'

It seems pretty clear to me that this is wrong.  If there is any
caching going on, it should clearly be discarded if I do a seek.

I totally disagree. If you need to discard the buffers, there's a way
to do it: flush(). If you force seek() to discard perfectly good
buffers you will hurt performance when not dealing with volatile data.

Anyway, in Python 2.x, the behavior of the various file methods is
documented as reflecting the underlying C stdio library. In fact, the
documentation for fseek specifically says it sets the file's current
position "like stdio's fseek()". Whatever stdio does is what Python
does. So even if this behavior were a bug, it would be a bug in
stdio, not in Python.

 Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results.  It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.

You can call f.flush() to force it to discard the cache. Or use
unbuffered I/O. Better yet, get rid of file I/O altogether and use an
memory mapped file.

So, is this a bug, or am I being stupid?

Well, it's not a bug, so....

Seriously, I advise you not to submit a bug report. Doesn't mean
you're stupid, maybe you didn't know about unbuffered I/O or the
flush() method. That just means you're uneducated. :) But please
leave seek() out it.


Carl Banks
 
M

Mike Kent

To followup on this:

Terry: Yes, I did in fact miss the 'buffer' parameter to open.
Setting the buffer parameter to 0 did in fact fix the test code that I
gave above, but oddly, did not fix my actual production code; it
continues to get the data as first read, rather than what is currently
on the disk. I'm still investigating why.

Carl: I tried the above test code, without 'buffer=0' in the open, but
with a flush added before reads in the appropriate places. The flush
made no difference; readline continued to return the old data rather
than what was actually on the disk. So, flush isn't the answer. I
suppose that means that, when the document states it flushes the
buffer, it's referring to the output buffer, not the input buffer.
 
P

pruebauno

To followup on this:

Terry: Yes, I did in fact miss the 'buffer' parameter to open.
Setting the buffer parameter to 0 did in fact fix the test code that I
gave above, but oddly, did not fix my actual production code; it
continues to get the data as first read, rather than what is currently
on the disk. I'm still investigating why.

Carl: I tried the above test code, without 'buffer=0' in the open, but
with a flush added before reads in the appropriate places. The flush
made no difference; readline continued to return the old data rather
than what was actually on the disk. So, flush isn't the answer. I
suppose that means that, when the document states it flushes the
buffer, it's referring to the output buffer, not the input buffer.

Something odd is going on for sure. I had a couple of theories but
then I tested it on both Windows XP and AIX and could not reproduce
the problem even using the default buffer setting. As soon as I do a
seek and read it gives me the new data. I wonder if other people can
test this out on different operating systems and file systems and
detect a pattern.
 
M

M.-A. Lemburg

The C lib uses a buffer for reading files and you are seeing the
affects of this.

Try using f = open('foo.txt', 'r', 0)

http://www.python.org/doc/2.5.2/lib/built-in-funcs.html#l2h-54
Consider the following:

$ echo "hi" >foo.txt # Create my test file
$ python2.5 # Run Python
Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
f = open('foo.txt') # Open my test file
f.seek(0) # Seek to the beginning of the file
f.readline() # Read the line, I get the data I expected 'hi\n'
# At this point, in another shell I execute 'echo "bye" >foo.txt'. 'foo.txt' now has been changed
# on the disk, and now contains 'bye\n'.
f.seek(0) # Seek to the beginning of the still-open file
f.readline() # Read the line, I don't get 'bye\n', I get the original data, which is no longer there. 'hi\n'
f.close() # Now I close the file...
f = open('foo.txt') # ... and reopen it
f.seek(0) # Seek to the beginning of the file
f.readline() # Read the line, I get the expected 'bye\n'
'bye\n'

It seems pretty clear to me that this is wrong. If there is any
caching going on, it should clearly be discarded if I do a seek. Note
that it's not just readline() that's returning me the wrong, cached
data, as I've also tried this with read(), and I get the same
results. It's not acceptable that I have to close and reopen the file
before every read when I'm doing random record access.

So, is this a bug, or am I being stupid?

Hello Mike,

I'm guessing that this is not a bug. I'm no expert, but I'd guess
that the open(file, mode) function simply loads the file into memory,
and that further operations (such as seek or read) are performed on
the in-memory data rather than the data on disk. Thus changes to the
file are only observed after a fresh open operation.

This behaviour is probably enforced by the C library on the machine
that you are using. If you want to be able to pick up data changes
like this then you're better off using a database package that has
support for concurrent access, locking and transactions.

Cheers,

Kev

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 23 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 
J

Joshua Kugler

Mike said:
To followup on this:

Terry: Yes, I did in fact miss the 'buffer' parameter to open.
Setting the buffer parameter to 0 did in fact fix the test code that I
gave above, but oddly, did not fix my actual production code; it
continues to get the data as first read, rather than what is currently
on the disk. I'm still investigating why.

What OS is your test code one? What OS is your production code on? As
mentioned read{line} will mirror the OS's underlying stdio.

j
 
T

Terry Reedy

Mike said:
To followup on this:

Terry: Yes, I did in fact miss the 'buffer' parameter to open.
Setting the buffer parameter to 0 did in fact fix the test code that I
gave above, but oddly, did not fix my actual production code; it
continues to get the data as first read, rather than what is currently
on the disk. I'm still investigating why.

Some hardware, OS?
How do you know what is currently 'on disk'? Even with 'buffering'
turned off, the disk is read and written in 'blocks'. 512 bytes was
common on unix. I suspect it is larger on Linux now. (4k on Windows,
typically). You *might* be seeing something as deep as the driver for a
particular disk. Good luck.
Carl: I tried the above test code, without 'buffer=0' in the open, but
with a flush added before reads in the appropriate places. The flush
made no difference; readline continued to return the old data rather
than what was actually on the disk. So, flush isn't the answer. I
suppose that means that, when the document states it flushes the
buffer, it's referring to the output buffer, not the input buffer.

Yes, I checked C99 reference
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top