How to use a 5 or 6 bit integer in Python?

G

Glen Wheeler

Hello all,

My program uses many millions of integers, and Python is allocating
way too much memory for these. I can't have the performance hit by
using disk, so I figured I'd write a C extension to define a new type.
Problem is, my C knowledge is years old and regardless of my
attempts distutils will not recognise my installation of the MS
compiler.
I am thinking, is there any easier way to use a 5 or 6 bit integer
in python? Even a regular 8-bit would be fine. I only need to
represent either 32 or 64 distinct numbers.

thanks,
Glen
 
B

Bengt Richter

Hello all,

My program uses many millions of integers, and Python is allocating
way too much memory for these. I can't have the performance hit by
using disk, so I figured I'd write a C extension to define a new type.
Problem is, my C knowledge is years old and regardless of my
attempts distutils will not recognise my installation of the MS
compiler.
I am thinking, is there any easier way to use a 5 or 6 bit integer
in python? Even a regular 8-bit would be fine. I only need to
represent either 32 or 64 distinct numbers.
You can store them efficiently in an array, e.g., for unsigned bytes
(or 'b' in place of 'B' for signed):
>>> import array
>>> bytes = array.array('B', range(10))
>>> bytes array('B', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> bytes[3]
3

We can only speculate on further help, until you tell us what you are doing ;-)

Regards,
Bengt Richter
 
P

Paul Rubin

Glen Wheeler said:
My program uses many millions of integers, and Python is allocating
way too much memory for these. I can't have the performance hit by
using disk, so I figured I'd write a C extension to define a new type.
Problem is, my C knowledge is years old and regardless of my
attempts distutils will not recognise my installation of the MS
compiler.
I am thinking, is there any easier way to use a 5 or 6 bit integer
in python? Even a regular 8-bit would be fine. I only need to
represent either 32 or 64 distinct numbers.

Look at the docs for the array module.
 
B

Ben Finney

My program uses many millions of integers, and Python is allocating
way too much memory for these.

How have you measured this? As pointed out by others, Python doesn't
have duplicate occurrences of identical integers; it refers to the same
object multiple times.

Given this, I'm curious how you've measured the memory usage caused by
your program using millions of integers, and why you think reducing the
size of those integers will help significantly.
 
R

Rainer Deyke

Glen said:
Hello all,

My program uses many millions of integers, and Python is allocating
way too much memory for these. I can't have the performance hit by
using disk, so I figured I'd write a C extension to define a new type.
Problem is, my C knowledge is years old and regardless of my
attempts distutils will not recognise my installation of the MS
compiler.
I am thinking, is there any easier way to use a 5 or 6 bit integer
in python? Even a regular 8-bit would be fine. I only need to
represent either 32 or 64 distinct numbers.

You're trying to solve the wrong problem. Python caches small integers, so
you've only got a very small number of distinct integer objects. This is a
good thing, because each Python object has a eight byte overhead.
Introducing a new type wouldn't result in any memory savings.

The real overhead comes from the references to those objects. Each
reference is four bytes (on 32 bit computers), and there is no way to
decrease this. Your best option is to use high level objects that contain
many integers internally but don't store them as individual Python
references. The array module is one choice, Numarray is another. If
immutability isn't an issue, you could even use strings (with each character
treated as an eight bit integer).
 
G

Glen Wheeler

You're trying to solve the wrong problem. Python caches small integers, so
you've only got a very small number of distinct integer objects. This is a
good thing, because each Python object has a eight byte overhead.
Introducing a new type wouldn't result in any memory savings.

Thanks. I had thought as much, but was not totally sure.
The real overhead comes from the references to those objects. Each
reference is four bytes (on 32 bit computers), and there is no way to
decrease this. Your best option is to use high level objects that contain
many integers internally but don't store them as individual Python
references. The array module is one choice, Numarray is another. If
immutability isn't an issue, you could even use strings (with each character
treated as an eight bit integer).

So the references to each integer is what causes the massive
overhead.
I've read up on Numarray and the array module in the docs, but
didn't see how I could adapt my current program to use these modules
because of how my data is currently organised.
I have one constantly changing dict with many millions of keys
(tuples of ints) which in turn associates itself with a tuple of ints
and two more dicts.
Every single container type contains only integers. Am I correct in
assuming that an array cannot be used as a key for a dictionary? As
they are mutable?

Thanks for your help,
Glen
 
G

Glen Wheeler

How have you measured this? As pointed out by others, Python doesn't
have duplicate occurrences of identical integers; it refers to the same
object multiple times.

Given this, I'm curious how you've measured the memory usage caused by
your program using millions of integers, and why you think reducing the
size of those integers will help significantly.

I realise now that I was wrong in assuming this. However, Python
does use more memory than it needs to to store the data I'm
manipulating. This is probably caused by their container types,
dictionaries and tuples, however.
Reducing the amount of memory allocated by any method is my goal
right now, along with increasing the speed at which my program runs.
I'm using one dictionary to store all of the data, with tuples of
ints as keys and a tuple as the data associated with each key. The
data tuples contain a few ints and two more dictionaries.
Can you think of a way to reduce the memory used here?

Thanks,
Glen
 
B

Bengt Richter

Thanks. I had thought as much, but was not totally sure.


So the references to each integer is what causes the massive
overhead.
I've read up on Numarray and the array module in the docs, but
didn't see how I could adapt my current program to use these modules
because of how my data is currently organised.
I have one constantly changing dict with many millions of keys
(tuples of ints) which in turn associates itself with a tuple of ints
and two more dicts.
Every single container type contains only integers. Am I correct in
assuming that an array cannot be used as a key for a dictionary? As
they are mutable?
Yes, but you might be able to do better than tuples, depending on what
the ordering means and whether the same number can repeat in a tuple, etc.

If we can tease out a little more about your problem, maybe we can help better ;-)
E.g., how do your tuple-keys come into being and how many times are they re-used?
If there is no nesting, you could create a string key just by ''.join([chr(el) for el in tup])
which wouldn't be as compact as a compressed bit string, but would be smaller than the tuple.
If you were lucky, a change could gain you both time and space.

I'm curious what your dicts and their int tuples represent in the real world ;-)

Regards,
Bengt Richter
 
G

Gabriel Genellina

I have one constantly changing dict with many millions of keys
(tuples of ints) which in turn associates itself with a tuple of ints
and two more dicts.

How big are the integers? If 16 bits for each are enough, you could replace
those 2-elem tuples with an plain integer: (a,b) -> a<<16 | b
You dont waste space from a tuple object (and its construction time) plus
two references for each dictionary key used, but you eventually create many
more integer instances. And there is a penalty accessing the dictionary
since you have to compute the key.


Gabriel Genellina
Softlab SRL
 
R

Rainer Deyke

Glen said:
I have one constantly changing dict with many millions of keys
(tuples of ints) which in turn associates itself with a tuple of ints
and two more dicts.

That makes things a bit harder. I expect that the big dict itself requires
at least sixteen bytes per entry. How big are your tuples? How big are the
dicts in the data tuples? What data do they contain? Are any of the tuples
or dicts shared?

Encoding the key tuples as strings is easy (''.join([chr(x) for x in
the_tuple]). Encoding the ints in the data tuples is just as easy. I'm not
sure if those are enough to solve your problem.
 
G

Glen Wheeler

That makes things a bit harder. I expect that the big dict itself requires
at least sixteen bytes per entry. How big are your tuples?

The key tuples, for the final calculation will range from 1 element
to 64 elements, with the majority at around 40-50 elements in size.
How big are the dicts in the data tuples? What data do they contain?

For the first dict: Increasingly smaller as the calculation
continues (i.e. as the key for that entry increases). Initially they
will have 64 keys (integers) with each integer having a 6-element
tuple of integers as it's data type. All integers are from 0-63.
For the second dict: Trivially small, with a maximum of 32
elements, starts at 0 and an average of around 7. Keys are kept as
integers from 0-63 and data ranges from -1-30, but the average case
will see the data at 7, as a single integer.
Are any of the tuples or dicts shared?

Unfortunately, no. This would be a major speedup for my program as
copying the dictionaries and tuples takes not only alot of memory but
alot of time too.
But the truth is that I need these to be unique on a key-by-key
basis (context from the big dict) so I can't see any way around it.
Encoding the key tuples as strings is easy (''.join([chr(x) for x in
the_tuple]). Encoding the ints in the data tuples is just as easy. I'm not
sure if those are enough to solve your problem.

I'll be sure to do this. Will it increase the hash speed for the
big dictionary significantly? Will lookup times or memory storage
requirements decrease?

Thanks,
Glen
 
B

Borcis

Glen said:
I'm using one dictionary to store all of the data, with tuples of
ints as keys and a tuple as the data associated with each key. The
data tuples contain a few ints and two more dictionaries.
Can you think of a way to reduce the memory used here?

What's the length of the key tuples, and are there bounds on the
component ints ? Is the number of ints in the data tuples fixed ? What's
in the subdictionnaries, typically ? "millions of 5 bit integers" hints
of the possibility of a radical change of representation, but a radical
change of representation can only be devised if we know what's stable
- the task.
 
R

Rainer Deyke

Glen said:
I'll be sure to do this. Will it increase the hash speed for the
big dictionary significantly? Will lookup times or memory storage
requirements decrease?

Using strings instead of tuples of integers should decrease memory usage. A
character in a string is stored as a single byte, whereas each tuple element
needs four bytes.
 
D

Derrick 'dman' Hudson

Glen :

I've read up on Numarray and the array module in the docs, but
didn't see how I could adapt my current program to use these modules
because of how my data is currently organised.

You may need to reorganize the data in order to save space!
I have one constantly changing dict with many millions of keys
(tuples of ints) which in turn associates itself with a tuple of ints
and two more dicts.

If you state the actual task and computation you are performing then
people can suggest alternate data structures and algorithms. Simply
describing your current data structure leaves people shooting in the
dark trying to suggest alternate ways of storing the data.

Usually time and space (memory) are mutually exclusive tradeoffs --
less computation requires more storage, less storage requires more
computation. However, there are cases where poor choice in data
structure and/or algorithm can allow a restructuring to improve both
speed and memory use.

In order to construct and evaluate data structure and algorithm
suitability a a concrete understanding of what task needs to be
performed by the program and what sort of data will be handled is
essential.

Presently you, Glen, are the only one with this knowledge of the task
and problem space. Share that knowledge with the group and the group
will then be capable of better assisting you.

-D
 
G

Glen Wheeler

What's the length of the key tuples, and are there bounds on the
component ints ?

Component ints will either be 0-63 or 0-31. In a single run of the
program, this will not change throughout.
The key tuples range in length from 1-64, depending on at what stage
the program is currently at. Every tuple will have the same length at
some point; the greatest possible distance in length from any two
tuples is 1.
Is the number of ints in the data tuples fixed ?

Yes, there are always two ints, and another tuple of ints. One int
is arbitrary, and can be in the thousands but not much higher.
Another int is from 0-5. The tuple of ints is less than 6 long and
will contain either 0-63 or 0-31, depending on the parameters given to
the program at run time.
What's in the subdictionnaries, typically ? "millions of 5 bit integers" hints
of the possibility of a radical change of representation, but a radical
change of representation can only be devised if we know what's stable
- the task.

Well, the subdictionaries contents are described in another reply
written by me in this same thread. There are indeed many millions of
integers beign stored at any one time, or at least many millions of
references to these integers.
You are saying that to save any amount of memory something in the
large dictionary must be constant? I am sure everything in there is
dynamic in nature, i.e. will change with each iteration over the keys
in the large dictionary.
 
G

Glen Wheeler

If we can tease out a little more about your problem, maybe we can help better ;-)
E.g., how do your tuple-keys come into being and how many times are they re-used?
If there is no nesting, you could create a string key just by ''.join([chr(el) for el in tup])
which wouldn't be as compact as a compressed bit string, but would be smaller than the tuple.
If you were lucky, a change could gain you both time and space.

I'll be implementing that this weekend, as my employer is expecting
results before Christmas. Damn deadlines.
I'm curious what your dicts and their int tuples represent in the real world ;-)

Well, I'm a researcher for an Australian university (University of
Wollongong) and my current task is to enumerate every possible 6 digit
binary Gray code.
Most recently people have done the 5 digit BGCs, something which I
have also accomplished, but never the 6 digit codes.
For a graphical representation, think of a cube in n-dimensions,
where n represents the number of digits in the code. A binary Gray
code around that cube represents a path starting from any arbitrary
point and then visiting every vertex exactly once.
The applications of such research are enourmous, and I'll not go
over them here, but that is my aim. To find the number of ways a 6
dimensional fly could walk a 6d cube visiting every vertex exactly
once.
I've actually been working on this problem for many months now and
have gotten so close. The previous best estimate for calculation time
(with a program written in C, I might add) was 2.8 * 10^15 years.
I've got the thing down to about 10^4 years. If the calculation
becomes tractable on a supercomputer, e.g. Barossa hosted at
www.ac3.com.au, then I'll run it on that.

I hope that's sated your curiosity for now :). If you'd like any
more information, or if anyone else would like to know anything about
this, then I'll be happy to correspond with them. I don't know how
on-topic it is for c.l.py.
 
H

Hartmut Goebel

Hi,

Glen said:
Component ints will either be 0-63 or 0-31. In a single run of the
program, this will not change throughout.
The key tuples range in length from 1-64, depending on at what stage
the program is currently at. Every tuple will have the same length at
some point; the greatest possible distance in length from any two
tuples is 1.

I'm afraid, I'll mix up some of your data-structure, but here are my
thoughts:

- pack the key-tuples into a string using pack; you may even pack 4
intergers into one byte since the values range from 1-64 (which is
equivalent to 0-63).

- if you need easy access to eg. the last emlement of the tuple, keep it
seperate. Eg: key = ('xdef', 12)

- As far as I remeber, you wrote something about max. 64 entries. What
about using an array/list here? Array may need a lot less memory
(depending on the muber of elements of course).

Only my 2 cents.

--
Regards
Hartmut Goebel

| Hartmut Goebel | We build the crazy compilers |
| (e-mail address removed) | Compiler Manufacturer |
 
O

Oren Tirosh

Encoding the key tuples as strings is easy (''.join([chr(x) for x in
the_tuple]). Encoding the ints in the data tuples is just as easy. I'm not
sure if those are enough to solve your problem.

Using strings is a really, really good idea. Strings are compact and
Python dictionaries are highly optimized for use with string keys with
tricks like caching of the string hash and faster code paths for special
cases like dicts that have only string keys.

There is a much faster way for converting tuples of integers to strings.
Instead of using the Python tuple type use arrays of bytes and then apply
the array object's .tostring() method:

It's about 35 times faster than ''.join([chr(x) for x in the_array])

Oren
 
O

Oren Tirosh

..
tuple of integers as it's data type. All integers are from 0-63.
For the second dict: Trivially small, with a maximum of 32
elements, starts at 0 and an average of around 7. Keys are kept as
integers from 0-63 and data ranges from -1-30, i

It's your lucky day... All your integers are from -1 to 63 which fits
nicely in Python's range of "privileged" integers from -1 to 100.
These are handled much faster by using a static table of integer
objects without the overhead of allocation and deallocatiom.

Oren
 
A

Anton Vredegoor

Glen Wheeler said:
Well, I'm a researcher for an Australian university (University of
Wollongong) and my current task is to enumerate every possible 6 digit
binary Gray code.

It is possible to write a function that turns every integer into its
"reflected" form so that the normal enumeration of the integers
corresponds to a gray code (successors that differ only in one bit
position) sequence. This can be done without enumerating them all.

Here's some code from a book by Kreher and Stinson (Combinatorial
ALgorithms) which I translated into Python from the pseudo-code in the
book. I made some adaptations and came up with this:

import sets

def unrank(n,r):
T,b1 = [],0
for i in range(n-1,-1,-1):
b0 = r/(2**i)
if b0 != b1:
T.append(n-i-1)
b1 = b0
r -= b0*2**i
return T

def rank(n,T):
S = sets.Set(T)
r,b = 0,0
for i in range(n-1,-1,-1):
if n-i-1 in S:
b = 1-b
if b == 1:
r += 2**i
return r

def reflect(n,r):
S = sets.Set(unrank(n,r))
L = ["01"[i in S] for i in range(n)]
return int("".join(L),2)

def test():
n = 2**6
T = [None]
for i in range(100):
U = unrank(n,i)
print i, reflect(n,i),U
assert rank(n,U) == i
assert len(sets.Set(T) ^ sets.Set(U)) == 1
T = U

if __name__=='__main__':
test()
I hope that's sated your curiosity for now :). If you'd like any
more information, or if anyone else would like to know anything about
this, then I'll be happy to correspond with them. I don't know how
on-topic it is for c.l.py.

If it's interesting and provides an excuse for exercising our
programming skills in our favorite language, it's on topic. It's even
on topic if it's outrageously off-topic enough, so take your pick :)

I guess I must have missed something because if the problem's solution
can be looked up in a book it probably is not what you are looking
for. So why is it not enough to have an indexing function and why do
you need to have all values literally?

Anton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,173
Messages
2,570,937
Members
47,481
Latest member
ElviraDoug

Latest Threads

Top