Why is it that even computer science undergrads
are required to learn the basics of Big-Oh and
all that?
So that they know how to correctly interpret what Big O notation means,
instead of misinterpreting it. Big O notation doesn't tell you everything
you need to know to predict the behaviour of an algorithm. It doesn't even
tell you most of what you need to know about its behaviour. Only actual
*measurement* will tell you what you need to know.
Perhaps you should actually sit down and look at the assumptions,
simplifications, short-cuts and trade-offs that computer scientists make
when they estimate an algorithm's Big O behaviour. It might shock you out
of your faith that Big O is the be all and end all of algorithm planning.
For all but the simplest algorithm, it is impractical to actually count
all the operations -- and even if you did, the knowledge wouldn't help
you, because you don't know how long each operation takes to get executed.
That is platform specific.
So you simplify. You pretend that paging never happens. That memory
allocations take zero time. That set up and removal of data structures
take insignificant time. That if there is an N**2 term, it always swamps
an N term. You assume that code performance is independent of the CPUs.
You assume that some operations (e.g. comparisons) take no time, and
others (e.g. moving data) are expensive.
Those assumptions sometimes are wildly wrong. I've been seriously bitten
following text book algorithms written for C and Pascal: they assume that
comparisons are cheap and swapping elements are expensive. But in Python,
swapping elements is cheap and comparisons are expensive, because of all
the heavy object-oriented machinery used. Your classic text book algorithm
is not guaranteed to survive contact with the real world: you have to try
it and see.
Given all the assumptions, it is a wonder that Big O estimates are ever
useful, not that they sometimes are misleading.
[snip]
I might as well interpret "blindingly fast" as meaning O(1). - Why not?
Surely marketing might also have reasoned like
this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must*
know whether it is O(N) or O(1).
You might _want_, but you don't _need_ to know which it is, not in every
case. In general, there are many circumstances where it makes no
sense to worry about Big O behaviour. What's your expected data look like?
If your data never gets above N=2, then who cares whether it is O(1)=1,
O(N)=2, O(N**2)=4 or O(2**N)=2? They are all about as fast.
Even bubble sort will sort a three element list fast enough -- and
probably faster than more complicated sorts. Why spend all the time
setting up the machinery for a merge sort for three elements?
[snip]
That's why Knuth liked it so much?
That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet
and Baeza-Yates liked it so much?
Two reasons: it is useful for telling you how a single algorithm will
scale as the input increases, just as I said.
And, unlike more accurate ways of calculating the speed of an algorithm
from first principles, it is actually possible to do Big O calculations.
No doubt the state of the art of algorithm measurements has advanced since
I was an undergraduate, but certain fundamental facts remain: in order to
calculate that Big O, you have to close your eyes to all the realities
of practical code execution, and only consider an idealised calculation.
Even when your calculation gives you constants of proportionality and
other coefficients, Big O notation demands you throw that information away.
But by doing so, you lose valuable information. An O(N**2) algorithm that
scales like 1e-6 * N**2 will be faster than an O(N) algorithm that scales
as 1e6 * N, until N reaches one million million. By tossing away those
coefficients, you wrongly expect the first algorithm to be slower than the
second, and choose the wrong algorithm.
And that's really very useful information indeed.
Yes it is. Did I ever say it wasn't?
Since, given such
information for the basic data types and operations, as implemented by
the language and its standard libraries, I stand a real chance of being
able to determine the computational complexity of the
*particular*combination* of data types and algorithms of my own small
utility or of a critical piece of my wonderful and large application, on
which the future of my company depends, with some confidence and
accuracy.
Yes, zero is a real chance.
[snip]
Depending on the "load factor" of the hash tables. So we would want to
ask, if we have very large lists indeed, how much space needs to be
invested to keep the load factor so low that we can say that the
membership test is O(1).
And knowing that hash tables are O(1) will not tell you that, will it?
There is only one practical way of telling: do the experiment. Keep
loading up that hash table until you start getting lots of collisions.
Do A-B and A&B have to walk the entire hash
table (which must be larger than the sets, because of a load factor <
1)? Also: the conversion of lists to sets needs the insertion of N
elements into those hash tables. That alone already makes the overall
algorithm *at*least* O(N). So forget about O(log N).
Yes, inserting N items into a hash table takes at least N inserts. But if
those inserts are fast enough, you won't even notice the time it takes to
do it, compared to the rest of your algorithm. In many algorithms, you
don't even care about the time it takes to put items in your hash table,
because that isn't part of the problem you are trying to solve.
So in real, practical sense, it may be that your algorithm gets dominated
by the O(log N) term even though there is technically an O(N) term in
there. Are Python dicts like that? I have no idea. But I know how to find
out: choose a problem domain I care about ("dicts with less than one
million items") and do the experiment.
So here, indeed, we have the kind of reasoning that one ought to be able
to deliver, based on what's in the Python documentation. Luckily, you
have that kind the knowledge of both, how sets are implemented and what
Big-Oh attaches to the hash table operation of "look up".
In order to *enable* SUCH reasoning for *everyone*,
starting from the module interface documentation only, one clearly needs
something along the lines that I was suggesting...
I don't object to having that Big O information available, except
insofar as it can be misleading, but I take issue with your position that
such information is necessary.
[snip]
Curious: N=10 takes less time than N=1?
Yes, funny how real-world results aren't as clean and neat as they are in
theory. There are those awkward assumptions coming to bite you again. I've
done two additional tests, and get:
N = 1 0.000085043907166
N = 10 0.000106656551361
N = 1 0.000497949123383
N = 10 0.000124049186707
Remember, these results are averaged over twenty trials. So why it is
quicker to do work with sets of size 10 than sets of size 1? Big O
notation will never tell you, because it ignores the implementation
details that really make a difference.
Why do we have such a comparatively large jump here, from N=100 to
N=1000? Did hash tables overflow during conversion or something like
that?
Who knows? Maybe Python was doing some garbage collection the first time I
run it. I've modified my code to print a scale factor, and here is another
run:
N = 1 0.00113509893417
N = 10 0.000106143951416 (x 0.093511)
N = 100 0.00265134572983 (x 24.978774)
N = 1000 0.0057701587677 (x 2.176313)
N = 10000 0.0551437973976 (x 9.556721)
N = 100000 0.668345856667 (x 12.120055)
N = 1000000 8.6285964489 (x 12.910376)
An increase from N=1 to 1000000 (that's a factor of one million) leads to
an increase in execution time of about 7600.
You will notice that the individual numbers vary significantly from trial
to trial, but the over-all pattern is surprisingly consistent.
Doesn't look quite O(n). Not yet...
No it doesn't.
That's a mistake -- it is nowhere near O(log N). My bad. Closer to
O(sqrt N), if anything.
How could that be? *Every* element of A and B must touched, if only to
be copied: that can't make it O(log(N)).
And, no doubt, if you had *really enormous* lists, oh, I don't know, maybe
a trillion items, you would see that O(N) behaviour. But until then, the
overall performance is dominated by the smaller-order terms with larger
coefficients.
Also, the conversion of lists
to sets must be at least O(N). And N isn't the right measure anyway. It
would probably have to be in terms of |A| and |B|. For example, if |A|
is very small, as compared to |B|, then A-B and A & B can be determined
rather quickly by only considering elements of A.
Both lists have the same number of elements, so double N.
[snip]
You must distinguish questions of principle and questions of muddling
through like this testing bit you've done.
Your "question of principle" gives you completely misleading answers.
Remember, you were the one who predicted that lists would have to be
faster than sets. Your prediction failed miserably.
It would take me some time to
even be *sure* how to interpret the result.
What's to interpret? I know exactly how fast the function will run, on
average, on my hardware. I can even extrapolate to larger sizes of N,
although I would be very careful to not extrapolate too far. (I predict
less than 10 minutes to work on a pair of 10,000,000 element lists, and
less than two hours to work on 100,000,000 element lists.)
I would never want to say
"it looks O(log N) to me", as you do, and leave it at that. Rather, I
might say, as you do, "it looks O(log N) to me", *but* then try to
figure out, given my knowledge of the implementation (performance wise,
based on information that is sadly missing in the Python documentation),
*why* that might be.
Fine. You have the source code, knock yourself out.
Then, if my experiments says "it looks like O(log
N)" AND if my basic knowledge of the implementation of set and list
primitives says "it should be O(log N)" as well, I would venture, with
some *confidence*, to claim: "it actually IS O(log N)"....
You do not compare the convert-it-to-sets-approach
to the single list-walk either.
No I did not, because I didn't have a function to do it. You've got my
source code. Knock yourself out to use it to test any function you like.
Supposing the OP had actually sorted
lists to begin with, then a single, simultaneous walk of the lists would
be about as fast as it can get. Very, very likely *faster* than
conversion to sets would be...
Please let us know how you go with that. It should be really interesting
to see how well your prediction copes with the real world.
(Hint: another of those awkward little implementation details... how much
work is being done in C code, and how much in pure Python? Just something
for you to think about. And remember, an O(N) algorithm in Python will be
faster than an O(N**2) algorithm in C... or is that slower?)