Python 3.1.2 and marshal

R

raj

Hi,

I am using 64 bit Python on an x86_64 platform (Fedora 13). I have
some code that uses the python marshal module to serialize some
objects to files. However, in moving the code to python 3 I have come
across a situation where, if more than one object has been serialized
to a file, then while trying to de-serialize only the first object is
de-serialized. Trying to de-serialize the second object raises an
EOFError. De-serialization of multiple objects works fine in Python
2.x. I tried going through the Python 3 documentation to see if
marshal functionality has been changed, but haven't found anything to
that effect. Does anyone else see this problem? Here is some
example code:

bash-4.1$ cat marshaltest.py
import marshal

numlines = 1
numwords = 25

stream = open('fails.mar','wb')
marshal.dump(numlines, stream)
marshal.dump(numwords, stream)
stream.close()

tmpstream = open('fails.mar', 'rb')
value1 = marshal.load(tmpstream)
value2 = marshal.load(tmpstream)

print(value1 == numlines)
print(value2 == numwords)


Here are the results of running this code

bash-4.1$ python2.7 marshaltest.py
True
True

bash-4.1$ python3.1 marshaltest.py
Traceback (most recent call last):
File "marshaltest.py", line 13, in <module>
value2 = marshal.load(tmpstream)
EOFError: EOF read where object expected

Interestingly the file created by using Python 3.1 is readable by both
Python 2.7 as well as Python 2.6 and both objects are successfully
read.

Cheers,
raj
 
T

Thomas Jollans

Hi,

I am using 64 bit Python on an x86_64 platform (Fedora 13). I have
some code that uses the python marshal module to serialize some
objects to files. However, in moving the code to python 3 I have come
across a situation where, if more than one object has been serialized
to a file, then while trying to de-serialize only the first object is
de-serialized. Trying to de-serialize the second object raises an
EOFError. De-serialization of multiple objects works fine in Python
2.x. I tried going through the Python 3 documentation to see if
marshal functionality has been changed, but haven't found anything to
that effect. Does anyone else see this problem? Here is some
example code:

Interesting. I modified your script a bit:

0:pts/2:/tmp% cat marshtest.py
from __future__ import print_function
import marshal
import sys
if sys.version_info[0] == 3:
bytehex = lambda i: '%02X ' % i
else:
bytehex = lambda c: '%02X ' % ord(c)

numlines = 1
numwords = 25

stream = open('fails.mar','wb')
marshal.dump(numlines, stream)
marshal.dump(numwords, stream)
stream.close()

tmpstream = open('fails.mar', 'rb')

for byte in tmpstream.read():
sys.stdout.write(bytehex(byte))

sys.stdout.write('\n')
tmpstream.seek(0)

print('pos:', tmpstream.tell())
value1 = marshal.load(tmpstream)
print('val:', value1)
print('pos:', tmpstream.tell())
value2 = marshal.load(tmpstream)
print('val:', value2)
print('pos:', tmpstream.tell())

print(value1 == numlines)
print(value2 == numwords)
0:pts/2:/tmp% python2.6 marshtest.py
69 01 00 00 00 69 19 00 00 00
pos: 0
val: 1
pos: 5
val: 25
pos: 10
True
True
0:pts/2:/tmp% python3.1 marshtest.py
69 01 00 00 00 69 19 00 00 00
pos: 0
val: 1
pos: 10
Traceback (most recent call last):
File "marshtest.py", line 29, in <module>
value2 = marshal.load(tmpstream)
EOFError: EOF read where object expected
1:pts/2:/tmp%

So, the contents of the file is identical, but Python 3 reads the whole
file, Python 2 reads only the data it uses.

This looks like a simple optimisation: read the whole file at once,
instead of byte-by-byte, to improve performance when reading large
objects. (such as Python modules...)

The question is: was storing multiple objects in sequence an intended
use of the marshal module? I doubt it. You can always wrap your data in
tuples or use pickle.
 
R

raj

So, the contents of the file is identical, but Python 3 reads the whole
file, Python 2 reads only the data it uses.

This looks like a simple optimisation: read the whole file at once,
instead of byte-by-byte, to improve performance when reading large
objects. (such as Python modules...)

Good analysis and a nice catch. Thanks. It is likely that the intent
is to optimize performance.
The question is: was storing multiple objects in sequence an intended
use of the marshal module?

The documentation (http://docs.python.org/py3k/library/marshal.html)
for marshal itself states (emphasis added by me),

marshal.load(file)¶

Read *one value* from the open file and return it. If no valid
value is read (e.g. because the data has a different Python version’s
incompatible marshal format), raise EOFError, ValueError or TypeError.
The file must be an open file object opened in binary mode ('rb' or 'r
+b').

This suggests that support for reading multiple values is intended.
I doubt it. You can always wrap your data in
tuples or use pickle.

The code that I am moving to 3.x dates back to the python 1.5 days,
when marshal was significantly faster than pickle and Zope was
evolutionarily at the Bobo stage :). I have switched the current code
to pickle - makes more sense. The pickle files are a bit larger and
loading it is a tad bit slower, but nothing that makes even a
noticeable difference for my use case. Thanks.

raj
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,734
Latest member
manin

Latest Threads

Top