Tuples and immutability

Roy Smith · Mar 10, 2014

Marko Rauhamaa said:
The O(f(n)) notation has no meaning when n is limited.

This thing is not just pedantry. The discussion was how a balanced tree
fares in comparison with hash tables. A simplistic O(1) vs O(log n) was
presented as a proof that balanced trees don't have a chance in practice
or theory. I wasn't so easily convinced.

Marko

On the other hand, log n, for n = 1 million, is just 20. It's not hard
to imagine a hash function which costs 20x what a node traversal does,
in which case, the log n lookup is ahead for all n < 1 million.

Looking at the Songza source, I see we have one user-defined hash
function:

def __hash__(self):
return hash((self.song,
self.user,
self.add_time,
self.delete_time,
self.play_first))

where those things are 2 bson.ObjectId's, 2 datetimes, and a boolean. I
wouldn't be surprised if that falls into the 20x node traversal bucket.
In this case, convenience was more important than speed, so it doesn't
matter.

The hash vs. tree argument can get very complicated. For example, if
your tree is not completely resident in memory, the cost of paging in a
node will swamp everything else, and improving lookup speed will boil
down to reducing the number of I/O operations you need. An expensive
hash plus a linear walk through a collision chain which was resident in
a single memory block will beat traversing two nodes, each of which had
to be paged in separately.

Rustom Mody · Mar 10, 2014

You are right that you and Steven have had a hard time communicating.
You are part of "you and Steven", it would be at least polite to
consider that part of the reason for the difficulty has to do with your
style. It can be brief and contrarian, which puts people off. Perhaps
if you tried to understand the gap and bridge it more, people would be
less inclined to think that you were trying to widen the gap.

--Ned.

Hi Ned

As you know on the whole I am thankful to you that you keep some order on this list.

I however find it strange and one-sided that you pull up Marko and not Steven given

1. Steven's response is almost entirely vituperative and that is in response to

2. Being pointed out that a finite-input table-lookup being called a
hash-function is a rather nonsensical claim and goes counter
to the basic tenets of asymptotic notation. (In CS unlike in math 'asymptote' is always infinity) IOW

3. If I start off with "I am going to out-pedant you..." and then goof up my
attempted pedantry whose fault is it?

Chris Angelico · Mar 10, 2014

The hash vs. tree argument can get very complicated. For example, if
your tree is not completely resident in memory, the cost of paging in a
node will swamp everything else, and improving lookup speed will boil
down to reducing the number of I/O operations you need. An expensive
hash plus a linear walk through a collision chain which was resident in
a single memory block will beat traversing two nodes, each of which had
to be paged in separately.

Indeed, which is broadly an extension of the "cache locality" argument.

I've never actually yearned for any of the advanced operations a tree
can give (over a hash table). Usually, by the time I'm looking for
that sort of thing, I really want an on-disk database - that solves
all the problems of paging (the DBMS will make sure it reads in a
minimum of index and data pages), and a good SQL database can handle
multiple indexes in a space-efficient way. Plus, what you said about
log 1,000,000? By the time you're looking at a million records, you
probably (a) need to have them on disk somewhere anyway, and (b) don't
want to read them all into RAM before you begin.

ChrisA

Chris Angelico · Mar 10, 2014

Looking at the Songza source, I see we have one user-defined hash
function:

def __hash__(self):
return hash((self.song,
self.user,
self.add_time,
self.delete_time,
self.play_first))

where those things are 2 bson.ObjectId's, 2 datetimes, and a boolean. I
wouldn't be surprised if that falls into the 20x node traversal bucket.
In this case, convenience was more important than speed, so it doesn't
matter.

The only difference between a tree and a hash here is that the tree
might be able to short-cut the comparisons. But if there are a whole
bunch of songs with the same "song" and "user", then the tree has to
compare (song->song? same; user->user? same; add_time->add_time?
left/right) multiple times. So what you're saying is that the hash
table has consistency but the tree could do better or could do worse.
(Imagine, worst case, all one million records have the same
song/user/add_time and you need to do twenty comparisons involving
four fields. That's gotta be worse than one hashing of five fields.)

ChrisA

Tim Chase · Mar 10, 2014

Imagine, worst case, all one million records have the same
song/user/add_time and you need to do twenty comparisons involving
four fields. That's gotta be worse than one hashing of five fields.

And if you have one million songs that are indistinguishable, you
have Pop Top-40 playlists. And that is *definitely* worse than
hashes OR trees.

-tkc

Chris Angelico · Mar 10, 2014

And if you have one million songs that are indistinguishable, you
have Pop Top-40 playlists. And that is *definitely* worse than
hashes OR trees.

LOL!

That's not a Comp Sci problem, I'm afraid. But there is a solution:

A different sort of music.

ChrisA

Marko Rauhamaa · Mar 10, 2014

Chris Angelico said:
The only difference between a tree and a hash here is that the tree
might be able to short-cut the comparisons. But if there are a whole
bunch of songs with the same "song" and "user", then the tree has to
compare (song->song? same; user->user? same; add_time->add_time?
left/right) multiple times. So what you're saying is that the hash
table has consistency but the tree could do better or could do worse.
(Imagine, worst case, all one million records have the same
song/user/add_time and you need to do twenty comparisons involving
four fields. That's gotta be worse than one hashing of five fields.)

You are right that in the end there probably is an equality test
involving all fields of the key. However, the intermediate comparisons
are often dealt with much more immediately. For example, to compare the
words "hello" and "world", you only need to look at the first letter.

Anyway, this whole debate is rather unnecessary since every developer is
supposed to have both weapons in their arsenal.

Marko

Chris Angelico · Mar 10, 2014

You are right that in the end there probably is an equality test
involving all fields of the key. However, the intermediate comparisons
are often dealt with much more immediately. For example, to compare the
words "hello" and "world", you only need to look at the first letter.

Right, which is what I meant about the consistency of a hash table.
You know, with a hash table, that you're going to need to calculate
the hash across everything. The tree lets you cheaply add
disambiguation fields, knowing they'll be used only if they're needed.

Oddly enough, the consistency in design is inverted by the
predictability under attack. If an attacker might be able to generate
hash collisions, the tree (assuming it's self-balancing) will have its
worst-case much closer to its average/typical cases, which means the
tree is more consistent. Strange how that goes, sometimes.

Anyway, this whole debate is rather unnecessary since every developer is
supposed to have both weapons in their arsenal.

Supposed to have? What does that mean, a language isn't ISO-compliant
unless it provides both? With a high level language like Python, using
the provided hash table will almost always cream any hand-built tree,
no matter how advantageous the data is to the tree.

ChrisA

Marko Rauhamaa · Mar 10, 2014

Chris Angelico said:
Supposed to have? What does that mean, a language isn't ISO-compliant
unless it provides both?

It's an ancient, fundamental data structure, right up there with dynamic
lists. There's no reason it shouldn't be available in every programming
environment.

With a high level language like Python, using the provided hash table
will almost always cream any hand-built tree, no matter how
advantageous the data is to the tree.

The main thing is there are use cases where order is essential. That's
why I have had to implement the AVL tree in Python myself. No biggie,
but a C implementation would probably be much faster. Also, a standard
version would likely be reviewed and tested better and have all Pythonic
accessors in place.

Marko

Roy Smith · Mar 10, 2014

Marko Rauhamaa said:
Anyway, this whole debate is rather unnecessary since every developer is
supposed to have both weapons in their arsenal.

The problem with having a choice is that it opens up the possibility of
making the wrong one

As this discussion has shown, figuring out whether a hash table or a
tree is better for a given problem is non-trivial. My guess is that if
you gave 1000 typical developers both data structures and let them pick
freely, the number of cases where it really mattered and the developer
picked the right one would be approximately equal to the number of cases
where they picked the wrong one.

Chris Angelico · Mar 10, 2014

As this discussion has shown, figuring out whether a hash table or a
tree is better for a given problem is non-trivial. My guess is that if
you gave 1000 typical developers both data structures and let them pick
freely, the number of cases where it really mattered and the developer
picked the right one would be approximately equal to the number of cases
where they picked the wrong one.

And both would be utterly dwarfed by the cases where it doesn't
matter, and everyone's paying a cost (having to choose) for no
benefit.

ChrisA

Dennis Lee Bieber · Mar 10, 2014

Proof: I create a hash table that accepts unsigned bytes as keys, where
the hash is the value of the byte. So byte 0x42 hashes to 0x42, or
decimal 66. I give the hash table 256 slots, numbered from 0 to 255.
Every hash takes constant time, there are no collisions at all, lookups,
insertions and deletions all take constant time regardless of how full
the table is. The best, worst and average cases are not just O(1) but
also ?(1).

What you've defined, however, can be replaced with straight list
indexing <G> No hash function needed.

If there is a finite (and small) number of possible keys, the list will
likely beat the hash.

OTOH, when the keys exceed reasonable list storage, the hash becomes
useful -- but one now has to be concerned with collisions.

Back in the day, I was somewhat pleased to discover that the Amiga
directory structure was a real world example of something I'd only seen as
a class assignment in my data structures class: a hashed-head
multiple-linked list.

Dan Stromberg · Mar 11, 2014

On the other hand, log n, for n = 1 million, is just 20. It's not hard
to imagine a hash function which costs 20x what a node traversal does,
in which case, the log n lookup is ahead for all n < 1 million.

FWIW, both the hash table and the tree will have constants. So a tree
would be c*20 in its most significant term, and the hash table would
be d*1 in its. The real-world performance depends quite a bit on
those constants at small values of n. I don't really consider a
million all that big, but the meaning of "big" of course depends.

Steven D'Aprano · Mar 11, 2014

The problem with having a choice is that it opens up the possibility of
making the wrong one

As this discussion has shown, figuring out whether a hash table or a
tree is better for a given problem is non-trivial. My guess is that if
you gave 1000 typical developers both data structures and let them pick
freely, the number of cases where it really mattered and the developer
picked the right one would be approximately equal to the number of cases
where they picked the wrong one.

You're very optimistic.

In my experience, the average developer has an amazing talent for
pessimising code when they think they are optimising it.

Chris Angelico · Mar 11, 2014

In my experience, the average developer has an amazing talent for
pessimising code when they think they are optimising it.

I remember a number of incidents from personal experience when I was a
*very* average developer. One time, writing C++ code, I looked at the
disassembly and decided the compiler was doing a terrible job. No no,
I could make this so much better by using the 80x86 "REP MOVSW"
command (or commands, depending on your point of view). That would be
so much better than all those separate operations the silly compiler
was doing! Roughly an hour of fiddling later, making sure it all still
worked correctly, I discover that... hmm, it's not actually any
faster. Turns out the 80x86 string opcodes are really inefficient;
they're short (a one-byte command that says "read a
byte/word/doubleword from DS:SI, write it to ES

I, and increment or
decrement SI and DI"), but not fast. In my defense, I at least did
measure before-and-after, and learned that I should back out that
change

ChrisA

Steven D'Aprano · Mar 11, 2014

2. Being pointed out that a finite-input table-lookup being called a
hash-function is a rather nonsensical claim and goes counter to the
basic tenets of asymptotic notation. (In CS unlike in math 'asymptote'
is always infinity) IOW

That's backwards. In maths "asymptote" always implies infinity. Within
any finite range, there must be a minimum distance that a line approaches
a curve, beyond which it gets no closer. But that's not an asymptote: an
asymptote requires that the line approaches the curve arbitrarily
closely, which requires taking the limit approaching infinity.

But in computer science, while it may be possible to ignore all real-
world considerations and imagine what happens when the size of your list
approaches infinity, that's not terribly common or useful. In reality,
all lists and hash tables contain only a finite number of slots, many
data types only have a finite number of possible keys, and beyond a
certain point the Big Oh analysis breaks down. Computer scientists are
not so foolish as to not understand this.

Big Oh analysis doesn't require behaviour as the input approaches
infinity, but rather as the input exceeds a certain (usually unstated)
size. "For large enough N, this function scales as O(N**2)" is a typical
conclusion. Not "As N approaches infinity, this function scales as
O(N**2)".

Many data types used as keys only have a fixed number of possible values
(256 bytes, 65536 short ints, etc.) and even those with no fixed limit
still have a practical finite limit. There's only so much matter in the
universe, so talking about limits as the amount of data approaches
infinity is nonsense. Where would you store it?

My example may have been extreme in its simplicity, but if you find
yourself in the lucky circumstance that you can afford a slot for every
possible key, and have a perfect hash function that guarantees no
collisions, then you will have the same conclusion: guaranteed O(1)
lookups, insertions and deletions.

http://en.wikipedia.org/wiki/Perfect_hash_function

Roy Smith · Mar 11, 2014

Dan Stromberg said:
FWIW, both the hash table and the tree will have constants. So a tree
would be c*20 in its most significant term, and the hash table would
be d*1 in its. The real-world performance depends quite a bit on
those constants at small values of n. I don't really consider a
million all that big, but the meaning of "big" of course depends.

Well, the largest disk volume I can configure in AWS is 1 TB. So, I
guess we can take that to be "big".

Assuming 1-character strings, and no overhead (both strange assumptions,
but it makes the math easier), that's 10^12 nodes in our binary tree.
That's still only 40 layers deep. Log n is your friend.

Roy Smith · Mar 11, 2014

Steven D'Aprano said:
There's only so much matter in the
universe, so talking about limits as the amount of data approaches
infinity is nonsense. Where would you store it?

Funny you should ask that. I just finished watching
https://en.wikipedia.org/wiki/11001001 a few minutes ago.

Chris Angelico · Mar 11, 2014

Funny you should ask that. I just finished watching
https://en.wikipedia.org/wiki/11001001 a few minutes ago.

Multiple universes and/or sub-atomic monkeys.

http://tools.ietf.org/html/rfc2795

ChrisA

Ian Kelly · Mar 11, 2014

I remember a number of incidents from personal experience when I was a
*very* average developer. One time, writing C++ code, I looked at the
disassembly and decided the compiler was doing a terrible job. No no,
I could make this so much better by using the 80x86 "REP MOVSW"
command (or commands, depending on your point of view). That would be
so much better than all those separate operations the silly compiler
was doing! Roughly an hour of fiddling later, making sure it all still
worked correctly, I discover that... hmm, it's not actually any
faster. Turns out the 80x86 string opcodes are really inefficient;
they're short (a one-byte command that says "read a
byte/word/doubleword from DS:SI, write it to ESI, and increment or
decrement SI and DI"), but not fast. In my defense, I at least did
measure before-and-after, and learned that I should back out that
change

Better to have tried and failed though than to have simply accepted
what the compiler was doing with no verification at all.

cmath, __float__ and __complex__	1	Feb 15, 2007
Issues with `codecs.register` and `codecs.CodecInfo` objects	2	Jul 6, 2012
Searching and manipulating lists of tuples	7	Jun 12, 2006
Lists and Tuples and Much More	15	Apr 12, 2007
Unsupported operand type(s) for +: 'float' and 'tuple'	6	Jun 10, 2011
Using tuples correctly?	3	Oct 11, 2004
__getattr__ and functions that don't exist	6	May 25, 2006
Lost in __setstate__() in C++ and swig	1	Sep 11, 2007

Tuples and immutability

Roy Smith

Rustom Mody

Chris Angelico

Chris Angelico

Tim Chase

Chris Angelico

Marko Rauhamaa

Chris Angelico

Marko Rauhamaa

Roy Smith

Chris Angelico

Dennis Lee Bieber

Dan Stromberg

Steven D'Aprano

Chris Angelico

Steven D'Aprano

Roy Smith

Roy Smith

Chris Angelico

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads