pickle alternative

S

simonwittber

I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType

It available for perusal here:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/415503

It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?


Sw.
 
A

Andrew Dalke

simonwittber said:
I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType

For simple data types consider "marshal" as an alternative to "pickle".
It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?


def dec_int_type(data):
value = int(unpack('!i', data.read(4))[0])
return value

That 'int' isn't needed -- unpack returns an int not a string
representation of the int.

BTW, your code won't work on 64 bit machines.

def enc_long_type(obj):
return "%s%s%s" % ("B", pack("!L", len(str(obj))), str(obj))

There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.

Every decode you do requires several function calls. While
less elegant, you'll likely get better performance (test it!)
if you minimize that; try something like this

def decode(data):
return _decode(StringIO(data).read)

def _decode(read, unpack = struct.unpack):
code = read(1)
if not code:
raise IOError("reached the end of the file")
if code == "I":
return unpack("!i", read(4))[0]
if code == "F":
return unpack("!f", read(4))[0]
if code == "L":
count = unpack("!i", read(4))
return [_decode(read) for i in range(count)]
if code == "D":
count = unpack("!i", read(4))
return dict([_decode(read) for i in range(count)]
...



Andrew
(e-mail address removed)
 
S

simonwittber

For simple data types consider "marshal" as an alternative to "pickle".
From the marhal documentation:
Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.
BTW, your code won't work on 64 bit machines.

Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?
There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.

Thanks for the tip.

Sw.
 
A

Andrew Dalke

simonwittber said:
Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.

Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.
Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?

How do you want to solve it? Should a 64 bit machine be able to read
a data stream made on a 32 bit machine? What about vice versa? How
are floats interconverted?

You could preface the output stream with a description of the encoding
used: version number, size of float, size of int (which should always
be sizeof float these days, I think). Read these then use that
information to figure out which decode/dispatch function to use.

Andrew
(e-mail address removed)
 
S

simonwittber

Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.

I tried out the marshal module anyway.

marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.

The aforementioned recipe took 2.53 seconds to serialize it into a
5000087 length string. decode took 5.16 seconds, which is much longer
than marshal!!

Sw.
 
A

Andrew Dalke

simonwittber said:
marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.

Strange. Here's what I found:
value = [r for r in xrange(1000000)] +[{1:2,3:4,5:6},{"simon":"wittber"}]
import time, marshal
t1=time.time();s=marshal.dumps(value);t2=time.time()
t2-t1 0.22474002838134766
len(s) 5000061
t1=time.time();new_value=marshal.loads(s);t2=time.time()
t2-t1 0.3606879711151123
new_value == value True

I can't reproduce your large times for marshal.dumps. Could you
post your test code?

Andrew
(e-mail address removed)
 
S

simonwittber

I can't reproduce your large times for marshal.dumps. Could you
post your test code?


Certainly:

import sencode
import marshal
import time

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

t = time.clock()
x = marshal.dumps(value)
print "marshal enc T:", time.clock() - t

t = time.clock()
x = marshal.loads(x)
print "marshal dec T:", time.clock() - t

t = time.clock()
x = sencode.dumps(value)
print "sencode enc T:", time.clock() - t
t = time.clock()
x = sencode.loads(x)
print "sencode dec T:", time.clock() - t
 
A

Andrew Dalke

simonwittber posted his test code.

I tooks the code from the cookbook, called it "sencode" and
added these two lines

dumps = encode
loads = decode


I then ran your test code (unchanged except that my newsreader
folded the "value = ..." line) and got

marshal enc T: 0.21
marshal dec T: 0.4
sencode enc T: 7.76
sencode dec T: 11.56

This is with Python 2.3; the stock one provided by Apple
for my Mac.

I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.

BTW, I tried the performance approach I outlined earlier.
The numbers aren't much better

marshal enc T: 0.2
marshal dec T: 0.38
sencode2 enc T: 7.16
sencode2 dec T: 9.49


I changed the format a little bit; dicts are treated a bit
differently.


from struct import pack, unpack
from cStringIO import StringIO

class EncodeError(Exception):
pass
class DecodeError(Exception):
pass

def encode(data):
f = StringIO()
_encode(data, f.write)
return f.getvalue()

def _encode(data, write, pack = pack):
# The original code use the equivalent of "type(data) is list"
# I preserve that behavior

T = type(data)

if T is int:
write("I")
write(pack("!i", data))
elif T is list:
write("L")
write(pack("!L", len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is tuple:
write("T")
write(pack("!L", len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is str:
write("S")
write(pack("!L", len(data)))
write(data)
elif T is long:
s = hex(data)[2:-1]
write("B")
write(pack("!i", len(s)))
write(s)
elif T is type(None):
write("N")
elif T is float:
write("F")
write(pack("!f", data))
elif T is dict:
write("D")
write(pack("!L", len(data)))
for k, v in data.items():
_encode(k, write)
_encode(v, write)
else:
raise EncodeError((data, T))


def decode(s):
"""
Decode a binary string into the original Python types.
"""
buffer = StringIO(s)
return _decode(buffer.read)

def _decode(read, unpack = unpack):
code = read(1)
if code == "I":
return unpack("!i", read(4))[0]
if code == "D":
size = unpack("!L", read(4))[0]
x = [_decode(read) for i in range(size*2)]
return dict(zip(x[0::2], x[1::2]))
if code == "T":
size = unpack("!L", read(4))[0]
return tuple([_decode(read) for i in range(size)])
if code == "L":
size = unpack("!L", read(4))[0]
return [_decode(read) for i in range(size)]
if code == "N":
return None
if code == "S":
size = unpack("!L", read(4))[0]
return read(size)
if code == "F":
return unpack("!f", read(4))[0]
if code == "B":
size = unpack("!L", read(4))[0]
return long(read(size), 16)
raise DecodeError(code)



dumps = encode
loads = decode


I wonder if this could be improved by a "struct2" module
which could compile a pack/unpack format once. Eg,

float_struct = struct2.struct("!f")

float_struct.pack(f)
return float_struct.unpack('?\x80\x00\x00')[0]
which might the same as
return float_struct.unpack1('?\x80\x00\x00')



Andrew
(e-mail address removed)
 
S

simonwittber

Andrew said:
This is with Python 2.3; the stock one provided by Apple
for my Mac.

Ahh that is the difference. I'm running Python 2.4. I've checked my
benchmarks on a friends machine, also in Python 2.4, and received the
same results as my machine.
I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.

It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Thanks for your feedback Andrew!

Sw.
 
A

Andrew Dalke

simonwittber said:
It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Interesting. Hadn't noticed that change. Is dump(StringIO()) as
slow?

Andrew
(e-mail address removed)
 
R

Reinhold Birkenfeld

Ahh that is the difference. I'm running Python 2.4. I've checked my
benchmarks on a friends machine, also in Python 2.4, and received the
same results as my machine.


It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Not so for me. My benchmarks suggest no change between 2.3 and 2.4.

Reinhold
 
M

mdoukidis

Running stest.py produced these results for me:

marshal enc T: 12.5195908977
marshal dec T: 0.134508715493
sencode enc T: 3.75118904777
sencode dec T: 5.86602012267
11.9369997978
0.109000205994
True

Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
on win32

Notice the slow "marshal enc"oding.
Overall this recipe is faster than marshall for me.

Mark
 
P

Paul Rubin

It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?

I think you should implement it as a C extension and/or write a PEP.
This has been an unfilled need in Python for a while (SF RFE 467384).

Note that using marshal is inappropriate, not only for security
reasons, but because marshal is explicitly NOT guaranteed to
interoperate across differing Python versions. You cannot assume that
an object marshalled in Python 2.4 will unmarshal correctly in 2.5.
 
S

simonwittber

I think you should implement it as a C extension and/or write a PEP.
This has been an unfilled need in Python for a while (SF RFE 467384).

I've submitted a proto PEP to python-dev. It coming up against many of
the same objections to the RFE.

Sw.
 
P

Paul Rubin

I've submitted a proto PEP to python-dev. It coming up against many of
the same objections to the RFE.

See also bug# 471893 where jhylton suggests a PEP. Something really
ought to be done about this.
 
S

simonwittber

See also bug# 471893 where jhylton suggests a PEP. Something really
ought to be done about this.

I know this, you know this... I don't understand why the suggestion is
meeting so much resistance. This is something I needed for a real world
system which moves lots of data around to untrusted clients. Surely
other people have had similar needs? Pickle and xmlrpclib simply are
not up to the task, but, perhaps Joe Programmer is content to use a
pickle, and not care for the security issues.

Owell. I'm not sure what I can say to make the case any clearer...


Sw.
 
P

Paul Rubin

I know this, you know this... I don't understand why the suggestion is
meeting so much resistance. This is something I needed for a real world
system which moves lots of data around to untrusted clients. Surely
other people have had similar needs? Pickle and xmlrpclib simply are
not up to the task, but, perhaps Joe Programmer is content to use a
pickle, and not care for the security issues.

I don't think there's serious objection to a PEP, but I don't read
python-dev. Maybe there was objection to some specific technical
point in your PEP. Why don't you post it here?
 
S

simonwittber

Ok, I've attached the proto PEP below.

Comments on the proto PEP and the implementation are appreciated.

Sw.



Title: Secure, standard serialization of simple python types.

Abstract

This PEP suggests the addition of a module to the standard library,
which provides a serialization class for simple Python types.


Copyright

This document is placed in the public domain.


Motivation

The standard library currently provides two modules which are used
for object serialization. Pickle is not secure by its very nature,
and the marshal module is clearly marked as being not secure in the
documentation. The marshal module does not guarantee compatibility
between Python versions. The proposed module will only serialize
simple built-in Python types, and provide compatibility across
Python versions.

See RFE 467384 (on SourceForge) for more discussion on the above
issues.


Specification

The proposed module should use the same API as the marshal module.

dump(value, file)
#serialize value, and write to open file object
load(file)
#read data from file object, unserialize and return an object
dumps(value)
#return the string that would be written to the file by dump
loads(value)
#unserialize and return object


Reference Implementation

http://metaplay.dyndns.org:82/~simon/gherkin.py.txt


Rationale

The marshal documentation explicitly states that it is unsuitable
for unmarshalling untrusted data. It also explicitly states that
the format is not compatible across Python versions.

Pickle is compatible across versions, but also unsafe for loading
untrusted data. Exploits demonstrating pickle vulnerability exist.

xmlrpclib provides serialization functions, but is unsuitable when
serializing large data structures, or when high performance is a
requirement. If performance is an issue, a C-based accelerator
module can be installed. If size is an issue, gzip can be used,
however, this creates a mutually exclusive size/performance
trade-off.

Other existing formats, such as JSON and Bencode (bittorrent) do
not handle some marginally complex python structures and/or all
the simple Python types.

Time and space efficiency, and security do not have to be mutually
exclusive features of a serializer. Python does not provide, in the
standard library, a serializer which can work safely with untrusted
data which is time and space efficient. The proposed gherkin module
goes some way to achieving this. The format is simple enough to
easily write interoperable implementations across platforms.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top