file corruption on windows - possible bug

Jeremy Jones · May 9, 2005

I've written a piece of code that iterates through a list of items and
determines the filename to write some piece of data to based on
something in the item itself. Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]

joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.
When I run this script on Linux, each file is 13,890 bytes and contains
no "\x00" characters. This piece of code::

#################################
import cStringIO

file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]

joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
#outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile = file_dict.setdefault(key, cStringIO.StringIO())
outfile.write("%s\n" % value)

for key, io_string in file_dict.items():
outfile = open("%s.txt" % key, "w")
io_string.seek(0)
outfile.write(io_string.read())
outfile.close()
#################################

results in files containing 16,390 bytes and no "\x00" characters on
Windows and 13,890 bytes on Linux and no "\x00" characters (file size
difference on Windows and Linux is due to line ending). I'm still doing
a setdefault on the dictionary to create an object if the key doesn't
exist, but I'm using a cStringIO object rather than a file object. So,
I'm treating this just like it was a file and writing it out later.

Does anyone have any idea as to why this is writing over 14,000 "\x00"
characters to my file to start off with where printable characters
should go and then writing the remainder of the file correctly?

Jeremy Jones

Duncan Booth · May 9, 2005

Jeremy said:
Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]

joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))
outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.

Your call to setdefault is opening the file for writing every time it is
called, but using only the first handle to write to the file. I presume you
get a nasty interaction between the file handle you are using to write and
the other file handles which open the file in a destructive ("w") mode.

The fix is simply to only open each file once instead of 2500 times. e.g.
(untested code)

for key, value in joined_list:
if key in file_dict:
outfile = file_dict[key]
else:
outfile = file_dict[key] = open("%s.txt" % key, "w")
outfile.write("%s\n" % value)

Bengt Richter · May 10, 2005

I've written a piece of code that iterates through a list of items and
determines the filename to write some piece of data to based on
something in the item itself. Here is a small example piece of code to
show the type of thing I'm doing::

#################################
file_dict = {}

a_list = [("a", "a%s" % i) for i in range(2500)]
b_list = [("b", "b%s" % i) for i in range(2500)]
c_list = [("c", "c%s" % i) for i in range(2500)]
d_list = [("d", "d%s" % i) for i in range(2500)]

joined_list = a_list + b_list + c_list + d_list

for key, value in joined_list:
outfile = file_dict.setdefault(key, open("%s.txt" % key, "w"))

You are opening files multiply, since the open is a default value expression that is
always evaluated. Try replacing the above line with the following two lines:
try: outfile = file_dict[key]
except KeyError: outfile = file_dict[key] = open("%s.txt" % key, 'w')

outfile.write("%s\n" % value)

for f in file_dict.values():
f.close()
#################################

Problem is, when I run this on Windows, I get 14,520 null ("\x00")
characters at the front of the file and each file is 16,390 bytes long.
When I run this script on Linux, each file is 13,890 bytes and contains
no "\x00" characters. This piece of code::

I don't want to think about the _exact_ explanation, but try the above (untested ;-)
and see if the symptoms change ;-)

Regards,
Bengt Richter

Output confusion	2	Mar 9, 2023
Object serialization: transfer from a to b (non-implemented code on b)	8	Apr 14, 2010
Buffer Overflow with Python 2.5 on Vista in import site	2	Mar 29, 2008
I can NOT install Anaconda on my Windows laptop correctly	2	Sep 18, 2023
Reading and writing to a file creates null characters	3	Jan 12, 2012
How to remove the password from Outlook PST File?	4	Jun 19, 2024
Potential Python 3.3.2 pyvenv bug on Windows	2	Jun 15, 2013
pyodbc data corruption problem	3	May 19, 2007

file corruption on windows - possible bug

Jeremy Jones

Duncan Booth

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads