Steven D'Aprano said:
If you have a million items, then the odds that those
million items happen to be sorted (the worst-case order) are 1 in a
million factorial. That's a rather small number, small enough that we can
discount it from ever happening, not in a million lifetimes of the
Universe.
If the items are coming from some inherently random process, then I
agree. But, not all data sources are random.
Imagine you had a web site, and wanted to store various bits of data
from the stream of input requests. You decided that the data structure
you were going to use was a balanced tree. If your keys were, say, a
hash of the client's IP address, then it's a pretty good guess they're
random. But, what if the keys were the time the request was received?
Those are obviously not random, and using them as a keys in a balanced
tree would give you sub-optimum performance.
This is not a hypothetical scenario. Songza uses MongoDB as our
database. The indexes are balanced trees. One of our biggest
collections has a multi-component key, one of the components being the
request time truncated to the hour. Insertion time into that collection
has an obvious sawtooth shape, with performance degrading as each hour
progresses and jumping back up as the time rolls over to the next hour.
This is due (we believe
) to the constant rebalancing of the index
trees.
Almost-sorted data sets are very common. For example, here's a pipeline
I run a lot (from memory, could have gotten some detail wrong):
grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort
| uniq -c
Field 7 is the timestamp for when a request came in. What I'm doing
here is counting how many requests of a certain type came in during each
minute of the day. Logging isn't exactly in chronological order, so I
need to sort the times before I count them. But, it's in *almost*
chronological order; a sort which had pathologically bad behavior on
almost sorted data would be unusable here.