Internals and complexity of types, containers and algorithms

H

Harald Luessen

Hi, I am new to python and I miss some understanding of the internals
of some types and containers. With my C/C++ background I hope to get
some hints to chose the best data structure for my programs. Here are
some questions:

- Is there a brief description (not source) how the types tuple,
string, list and dict are represented internally. Is a list behind
the scenes just a double linked list or an array or a mixture of
these things? Is a dict a tree or a hash array and what is the
collision mechanism? Is the string an array with some header info?

- What is the big-O complexity of the most common algorithms for
these types and containers? Is it O(n), O(n*log(n)) or O(n**2)?
I mean inserting, appending (front or back), finding something
and so on.

- The same questions for important and common library containers
if you can name some.

- Is this information somewhere in the web? Why is it not written
in the documentation?

- When I want to use a representation of a game board like chess
in C/C++ I use an array[64] or bigger of int or char for the pieces.
What data structure or type would be useful in Python when the
performance ist most important? Is it list or string or an array
from a library or what else?

Harald
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

- Is there a brief description (not source) how the types tuple,
string, list and dict are represented internally. Is a list behind
the scenes just a double linked list or an array or a mixture of
these things?

Sure, see below:

- tuples are represented as arrays, with a single block for the
entire objects (object header, tuple size, and data)
- list are represented as arrays, with two memory blocks:
one for object header and sizes, and the other one for the
"guts", i.e. the actual data. The list uses over-allocation,
to avoid resizing on each addition.
- strings are implemented as arrays, with a single block for
the entire string. In addition to header, size, and data,
it also contains a cached hash and a pointer to the interned
version of the string (if any).
- dicts are implemented as hash tables, with open addressing.
Is a dict a tree or a hash array and what is the
collision mechanism?

It's a hash array. Not sure what you mean by "collision
mechanism" - perhaps "open addressing" is the answer?
In case you also want to know the probing:

j(n+1) = 5*j(n) + 1 + perturb(n)
perturb(n+1) = perturb(n) >> 5
Is the string an array with some header info?

Exactly so.
- What is the big-O complexity of the most common algorithms for
these types and containers? Is it O(n), O(n*log(n)) or O(n**2)?
I mean inserting, appending (front or back), finding something
and so on.

O(1) for indexed access in lists, tuples, and strings, and for
computing the length of any container.
O(n) for finding in lists, tuples, and strings. O(1) for inserting
and finding in dictionaries, assuming few collisions, except
when rehashing occurs. Amortized O(1) for inserting into lists.
O(n) for removing from lists.
O(1) for inserting inot
- Is this information somewhere in the web?

Most likely.
Why is it not written in the documentation?

Because nobody has contributed such documentation.
- When I want to use a representation of a game board like chess
in C/C++ I use an array[64] or bigger of int or char for the pieces.
What data structure or type would be useful in Python when the
performance ist most important? Is it list or string or an array
from a library or what else?

Depends on whether and how you want to modify the chessboard.
If you do want to modify, use a list.

Regards,
Martin
 
J

James Stroud

Harald said:
Hi, I am new to python and I miss some understanding of the internals
of some types and containers. With my C/C++ background I hope to get
some hints to chose the best data structure for my programs. Here are
some questions:

This depends on how you define "best". If you want speed and
optimization, you can use the numpy package built with ATLAS tuned for a
specific machine.

Beyond speed, "best" in the python community usually means "most suited"
from an idiomatic perspective and from the perspective of structure that
lends itself to long term maintainability because [C]python data
structures seem to undergo optimizations in their implementation at each
revision.

Your best bet is probably to forget about implementation and write code
that makes sense. For example, some have suggested a tuple-keyed
dictionary to represent a chess board:

board = ((c,r) for r in xrange(1, 9) for c in 'abcdefgh')
starting = 'RNBQKBNR' + 'P' * 8 + ' ' * 32 + 'p' * 8 + 'rnbqkbnr'
position = dict(zip(board, starting))

Of course we see here the chief problem with algebraic notation: it
begins numbering at 1 which is painfully reminiscient of fortran.

James
 
H

Harald Luessen

This depends on how you define "best". If you want speed and
optimization, you can use the numpy package built with ATLAS tuned for a
specific machine.

Are there arrays in numpy that can efficiently be used for
other things than matrix arithmetic? Are they faster than lists
but still easy to use?

no_piece = 0
wpawn = 1
....
board[square] = no_piece
board[square+8] = pawn
....

could be a typical pawn move.
Beyond speed, "best" in the python community usually means "most suited"
from an idiomatic perspective and from the perspective of structure that
lends itself to long term maintainability because [C]python data
structures seem to undergo optimizations in their implementation at each
revision.

I like the python way of "best" code. But in this particular
question the emphasis was on performance and speed.
Your best bet is probably to forget about implementation and write code
that makes sense. For example, some have suggested a tuple-keyed
dictionary to represent a chess board:

board = ((c,r) for r in xrange(1, 9) for c in 'abcdefgh')
starting = 'RNBQKBNR' + 'P' * 8 + ' ' * 32 + 'p' * 8 + 'rnbqkbnr'
position = dict(zip(board, starting))

I am not new to board game programming and I am used to
an array approach or even bitboards. Therefore the dictionary
looks a little bit strange to me.

Harald
 
H

Harald Luessen

Sure, see below:

- tuples are represented as arrays, with a single block for the
entire objects (object header, tuple size, and data)
- list are represented as arrays, with two memory blocks:
one for object header and sizes, and the other one for the
"guts", i.e. the actual data. The list uses over-allocation,
to avoid resizing on each addition.
- strings are implemented as arrays, with a single block for
the entire string. In addition to header, size, and data,
it also contains a cached hash and a pointer to the interned
version of the string (if any).
- dicts are implemented as hash tables, with open addressing.
.... and more interesting things ...

Thank you. That was the information I was looking for.
I just forgot to ask for sets.

Harald
 
J

Josiah Carlson

Harald said:
... and more interesting things ...

Thank you. That was the information I was looking for.
I just forgot to ask for sets.

Generally, sets are dictionaries without values (and some tweaks related
to handling set operations).

- Josiah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top