file corruption on windows - possible bug

J

Jeremy Jones

I've written a piece of code that iterates through a list of items and
determines the filename to write some piece of data to based on
something in the item itself. Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]


joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.
When I run this script on Linux, each file is 13,890 bytes and contains
no "\x00" characters. This piece of code::

#################################
import cStringIO

file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]


joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
#outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile = file_dict.setdefault(key, cStringIO.StringIO())
outfile.write("%s\n" % value)

for key, io_string in file_dict.items():
outfile = open("%s.txt" % key, "w")
io_string.seek(0)
outfile.write(io_string.read())
outfile.close()
#################################

results in files containing 16,390 bytes and no "\x00" characters on
Windows and 13,890 bytes on Linux and no "\x00" characters (file size
difference on Windows and Linux is due to line ending). I'm still doing
a setdefault on the dictionary to create an object if the key doesn't
exist, but I'm using a cStringIO object rather than a file object. So,
I'm treating this just like it was a file and writing it out later.

Does anyone have any idea as to why this is writing over 14,000 "\x00"
characters to my file to start off with where printable characters
should go and then writing the remainder of the file correctly?


Jeremy Jones
 
D

Duncan Booth

Jeremy said:
Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]


joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.

Your call to setdefault is opening the file for writing every time it is
called, but using only the first handle to write to the file. I presume you
get a nasty interaction between the file handle you are using to write and
the other file handles which open the file in a destructive ("w") mode.

The fix is simply to only open each file once instead of 2500 times. e.g.
(untested code)

for key, value in joined_list:
if key in file_dict:
outfile = file_dict[key]
else:
outfile = file_dict[key] = open("%s.txt" % key, "w")
outfile.write("%s\n" % value)
 
B

Bengt Richter

I've written a piece of code that iterates through a list of items and
determines the filename to write some piece of data to based on
something in the item itself. Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]


joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
You are opening files multiply, since the open is a default value expression that is
always evaluated. Try replacing the above line with the following two lines:
try: outfile = file_dict[key]
except KeyError: outfile = file_dict[key] = open("%s.txt" % key, 'w')
outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.
When I run this script on Linux, each file is 13,890 bytes and contains
no "\x00" characters. This piece of code::
I don't want to think about the _exact_ explanation, but try the above (untested ;-)
and see if the symptoms change ;-)

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top