Tuples and immutability

Steven D'Aprano · Mar 19, 2014

I used to do essentially this, but it was time-prohibitive and produced
harder-to-read graphs - harder to read because the enormous values of
the bad trees were dwarfing the values of the good trees.

A log graph may be the solution to that: graph the log of the time rather
than time itself.

Marko Rauhamaa · Mar 19, 2014

Steven D'Aprano said:
If you are in a position to randomize the data before storing it in
the tree, an unbalanced binary tree is a solid contender.

Applications that can assume randomly distributed data are exceedingly
rare and even then, you can't easily discount the possibility of
worst-case ordering.

In fact, Dan himself said unbalanced trees performed so badly he
couldn't test them satisfactorily.

Marko

Steven D'Aprano · Mar 19, 2014

Applications that can assume randomly distributed data are exceedingly
rare

Please re-read what I wrote. I didn't say "if your data comes to you
fully randomized". I said "if you are in a position to randomize the data
before storing it". In other words, if you actively randomize the data
yourself.

and even then, you can't easily discount the possibility of
worst-case ordering.

Of course you can. If you have a million items, then the odds that those
million items happen to be sorted (the worst-case order) are 1 in a
million factorial. That's a rather small number, small enough that we can
discount it from ever happening, not in a million lifetimes of the
Universe.

Of course, you would be right to point out that one would also like to
avoid *nearly* worst-case ordering. Nevertheless, there are Vastly more
ways that the data will be sufficiently randomized as to avoid degenerate
performance than the worst-case poor performance.

In fact, Dan himself said unbalanced trees performed so badly he
couldn't test them satisfactorily.

You're misrepresenting what Dan said. What he actually said was that
unbalanced trees perform second only to dicts with random data, only with
sorted data do they perform badly. His exact words were:

"For a series of random keys, it would do quite well (probably second only
to dict), but for a series of sequential keys it would take longer than
anyone would reasonably want to wait"

Marko Rauhamaa · Mar 19, 2014

Steven D'Aprano said:
Please re-read what I wrote. I didn't say "if your data comes to you
fully randomized". I said "if you are in a position to randomize the
data before storing it". In other words, if you actively randomize the
data yourself.

Yes, you could implement a "hash table" that way.

Marko

Roy Smith · Mar 19, 2014

Steven D'Aprano said:
If you have a million items, then the odds that those
million items happen to be sorted (the worst-case order) are 1 in a
million factorial. That's a rather small number, small enough that we can
discount it from ever happening, not in a million lifetimes of the
Universe.

If the items are coming from some inherently random process, then I
agree. But, not all data sources are random.

Imagine you had a web site, and wanted to store various bits of data
from the stream of input requests. You decided that the data structure
you were going to use was a balanced tree. If your keys were, say, a
hash of the client's IP address, then it's a pretty good guess they're
random. But, what if the keys were the time the request was received?
Those are obviously not random, and using them as a keys in a balanced
tree would give you sub-optimum performance.

This is not a hypothetical scenario. Songza uses MongoDB as our
database. The indexes are balanced trees. One of our biggest
collections has a multi-component key, one of the components being the
request time truncated to the hour. Insertion time into that collection
has an obvious sawtooth shape, with performance degrading as each hour
progresses and jumping back up as the time rolls over to the next hour.
This is due (we believe

) to the constant rebalancing of the index
trees.

Almost-sorted data sets are very common. For example, here's a pipeline
I run a lot (from memory, could have gotten some detail wrong):

grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort
| uniq -c

Field 7 is the timestamp for when a request came in. What I'm doing
here is counting how many requests of a certain type came in during each
minute of the day. Logging isn't exactly in chronological order, so I
need to sort the times before I count them. But, it's in *almost*
chronological order; a sort which had pathologically bad behavior on
almost sorted data would be unusable here.

Ethan Furman · Mar 19, 2014

A log graph may be the solution to that: graph the log of the time rather
than time itself.

I don't think that will solve the problem of not wanting to wait three days for the test to finish.

cmath, __float__ and __complex__	1	Feb 15, 2007
Issues with `codecs.register` and `codecs.CodecInfo` objects	2	Jul 6, 2012
Searching and manipulating lists of tuples	7	Jun 12, 2006
Lists and Tuples and Much More	15	Apr 12, 2007
Unsupported operand type(s) for +: 'float' and 'tuple'	6	Jun 10, 2011
Using tuples correctly?	3	Oct 11, 2004
__getattr__ and functions that don't exist	6	May 25, 2006
Lost in __setstate__() in C++ and swig	1	Sep 11, 2007

Tuples and immutability

Steven D'Aprano

Marko Rauhamaa

Steven D'Aprano

Marko Rauhamaa

Roy Smith

Ethan Furman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads