Before I begin my analysis of others solutions, let me share two
things I learned, in the hopes that it will help someone else:
1) I procrastinate. (I didn't learn this - I already knew it.) I
discovered this manifesting itself in my code as I repeatedly put off
the hard part of actually creating the playlist, and ended up
refactoring my classes about 6 times, and spending a full half hour
on just cleaning up the names of the songs. I learned the importance
of doing a 'red thread' spike - lay down the core functionality
first, get it working, and then make it pretty later.
If you run out of time on a project, it's arguably better to have
SOMETHING working, than a really nice solution that's only half
implemented, and doesn't do anything.
2) Premature Optimization is the Root of All Evil[1] - I had plans to
implement my solution in the most brain-dead simple way possible, and
only then (if necessary) optimize it. Somewhere along the way I
wandered off the path, and ended up using 'ugly' hashes of hashes for
speed sake and a single recursive function. In the end, I did need to
do a bit of optimization to get it to run reasonably (see below). But
even before that, the damage to my code was severe enough that when I
wanted to try out a variation on the algorithm, my own code was ugly
enough to deter me from messing with it. Even after my optimization,
however, the algorithm was pretty brain-dead stupid, and STILL
finished in a tenth of a second.
Moral - don't kill yourself on optimization until you're sure it's
necessary.
So, on to the code analysis. Every solution needed to do four things
to succeed:
1) Decide the parameters for the playlist to create (user input).
2) Load in the information for a song library.
3) Find a valid "Barrel of Monkeys" playlist based on the library and
the playlist parameters.
4) Display the resulting solution.
Let's look at the solutions for each of these, spending the most time
on the #3
==========================
Decide the Playlist Parameters
==========================
Ilmari, Pedro and I all picked songs at random from the library to
make a path between. None of our solutions had any customization in
terms of types of playlist to create. So much for the value of the
Highline quiz ... it seems that UI is still an afterthought when it
comes to solving a problem
Brian's solution does support different kinds of playlists (any path,
shortest path, and best time fill) but he didn't have time to create
a user UI, so his solution hard codes a few start and end points and
then solves them.
James' solution assumed that you would supply two command-line
arguments, the names of the first and last songs. I hadn't know about
the Enumerable#find method, so it was nice to see it used here
(finding the first song which had the supplied substring somewhere in
the song title). Beyond that, his solution had no further parameters
needed for a playlist.
Dave didn't let you pick the specific songs to start and end with
(you pick a letter for the first song should start with and a letter
for the last song to end with), but otherwise went all out on the
options - his solution supports finding playlists with minimum,
maximum, and target number of songs and/or playlist duration. The
console input provides defaults while entering, making things
particularly convenient. So, apparently the Highline quiz was
valuable after all
==========================
Load in the Song Library
==========================
The provided song library was an XML file. There were three
approaches to handling this, with very different speed results:
Gavin, James, Ilmari and Brian all read the XML file using REXML, and
then used it's access methods to create instances of a Song class.
(Ilmari actually used zlib to load the gzipped version directly, with
roughly no performance hit. Nice!)
Time to load the XML file into an REXML Document: ~4s
Time to scrape the information from that into Song instances:
Gavin: 28s (I did a lot of name cleaning during song
initialization)
James: 12s
Ilmari: 4.5s
Brian: 2.6s
Dave converted the library into YAML (how, I'm not sure) and loaded
that directly.
Time to load the YAML file of Songs: 1.5s
Pedro went hardcore and simply used regular expressions on each line
of the XML file.
Time to read the file and use regexp to find information: 0.2s
Gavin and Ilmari both dumped their libraries to a binary file using
Marshal after parsing it the first time. I know that for me, this
turned out to be a great decision, shaving 30+ seconds off every test
iteration I had to do.
Time to load a Marshalled file of songs: 0.1s
It appears that I was the only one who tried to do something about
all the terribly-named songs in the song library I supplied. I used
one of the solutions from the English Numerals quiz to turn integers
into english words. I tried to intelligently remove various "remix"
and "live" and "edit" variations on the name. In the end, I wish I
hadn't - the english numerals were a nice touch (and let me use songs
with names like "18" or "'74-'75"), but the rest was an exercise in
futility. Any reasonable playlist system should assume that it's
going to have a really nice, cleaned playlist fed to it. Regexp
heuristics to clean up messy names just aren't a substitute for a few
hours spent fixing one's ID3 tags
==========================
Create the Playlist
==========================
So, as the astute may have noticed, this quiz is really just a path-
finding in fancy clothes. Just as Mapquest creates directions from
point A to point B using roads connecting intersections, so this quiz
is about finding your way from song A to song B. There are just a
hell of a lot more roads connecting the intersections.
I'll cover my approach in depth, and then look at the code other
participants supplied.
As I mentioned above, I decided that I would first try a really brain-
dead algorithm and see how it performed. (I intended to try a
solution on my own and later do research into optimized algorithms in
this field, but never got around to the latter.)
After reading in all the songs, I partitioned them off into a Hash
that mapped each letter of the alphabet to the songs that started
with that letter. For each of those songs, I then figured out all the
unique end letters that you could get to. For simplicity sake,
whenever there were multiple songs with the same start/end letters
(e.g. "All Out of Love" and "A
Whiter Shade of Pale") I simply picked one of them at random and
threw out the rest.
I also threw out all the cases where songs started with and ended
with the same letter. These might be very useful in determining
specific-length playlists, but I didn't want to deal with them.
The end result looked something like this:
@song_links = {
'a' => { 'c' => <Song 'Aerodynamic'>, 'd' => <Song 'All I
Need'>, ... },
'b' => { 'a' => <Song 'Bawitdaba'>, 'e' => <Song 'Brother of
Mine'>, ... },
...
}
This hash of hashes was then my roadmap, telling me what
intersections I could get to for each letter, and what songs I
travelled along to take that path.
To walk 50 steps, you first have to walk 49 and then take the last
step. Similarly, I decided that I would write a recursive function
that checked to see if two songs could be linked together. If they
could, that was the path. If not, take one 'step' forward - see if a
path to the last song existed from each of the songs which connected
to the first.
My first brain-dead approach just let me wander along each
possibility until I found each match, and stored all the successes.
(My idea was to find all the solutions and then select the shortest
one from the list.) I quickly ran into neverending loops as I visited
the same song again and again. So, I started passing along a string
of all the letters I had already tried. Before taking a step to a new
song, I checked to make sure it wasn't in this list.
NOW I WAS COOKING! Along with some debug output, I watched as my code
began traverse the possibility path. And watch, and watch. I started
thinking about it.
With 5000 songs evenly distributed across all the letters, and paths
that can go 26 levels deep, my back-of-the-envelope calculations led
me to realize there there were (very roughly) something like
878406105516319579477535398778457892291056669879055897379772997878407474
708480000000000 possible paths. Ooops. That would take a while to
search the entire possibility tree. After waiting for 20 minutes, I
realized I might have to wait a really long time.
So, I made one more adjustment - whenever I found a valid path, I
stored the length of that path in an instance variable. Any time my
recursive function tried to go deeper than than, I bailed out. (If
know one route to get to the mall that takes 3 miles, and you start
down a strange road looking for a shortcut, after you've gone more
than 3 miles you know it's time to head home. You might be able to
get to the mall by driving across the country and back, but it's not
the solution you're looking for.)
And WHAM...my brain-dead solution went from taking longer than the
universe has existed, to finding solutions in less than a second.
Usually less than 1/10th of a second.
Take THAT, optimized algorithms!
[As a disclaimer, in the cases where no such path exists the code has
to end up searching the entire possibility space, so...it essentially
hangs when it can't find the right path.]
I had visions of using the same sort of solution to find a specific-
length playlist: I would accumulate time as I recursed, and bail out
once I had passed the limit. And then I went to bed instead.
So, on to looking at others' solutions, as best I can.
Pedro's solution appears to use recursion, like mine. I can't quite
tell how it's preventing infinite circular lists, or how it goes so
fast. It stores a list of possible solutions, and when finished it
yields the list of all found solutions to a passed-in proc to
determine which solution is correct.
So, for example, the line:
result=search(songs, first, last, &min_dur)
picks the playlist with the shortest duration from those it found,
while:
result=search(songs, first, last, &min_len)
picks the playlist with the fewest songs.
Because Pedro's solution didn't do any song name cleaning, it can
produce some interesting results:
Proudest Monkey - Dave Matthews Band - 551 ms ===> 'Round The World
With The Rubber Duck - C.W. McCall - 247 ms
Proudest Monkey - Dave Matthews Band - 551 ms
You May Be Right - Billy Joel - 255 ms
Teacher, Teacher - .38 Special - 193 ms
Rock Me - ABBA - 185 ms
Eden - 10,000 Maniacs - 250 ms
North - Afro Celt Sound System - 409 ms
Here Without You - 3 Doors Down - 238 ms
Under Attack - ABBA - 227 ms
Kelly Watch The Stars - Air [French Band] - 226 ms
Saguaro - A Small, Good Thing - 353 ms
Ostrichism - A Small, Good Thing - 152 ms
Mamma Mia - ABBA - 212 ms
All I Need - Air [French Band] - 268 ms
Does Your Mother Know - ABBA - 195 ms
When You're Falling - Afro Celt Sound System - 314 ms
God Must Have Spent A Little.. - Alabama - 203 ms
... to calm a turbulent soul - Falling You - 469 ms
Losing Grip - Avril Lavigne - 233 ms
Preacher in this Ring, Part I - Bruce Hornsby - 302 ms
Intergalactic - Beastie Boys - 231 ms
CDJ - Pizzicato Five - 344 ms
Jumpin' Jumpin' - Destiny's Child - 229 ms
'Round The World With The Rubber Duck - C.W. McCall - 247 ms
(Note the linkage of songs with apostrophes or periods.
However it traverses the results, it's speediness seems to be
mitigated by results that aren't quite as short as possible. For
example (after I hacked Pedro's code to ignore everything but spaces
and letters in a title) it thinks that the shortest playlist between
"Que Sera" and "Zaar" is:
Que Sera - Ace of Base - 227 ms ===> Zaar - Peter Gabriel - 178 ms
Que Sera - Ace of Base - 227 ms
Angeleyes - ABBA - 259 ms
Second Chance - .38 Special - 271 ms
Eden - 10,000 Maniacs - 250 ms
North - Afro Celt Sound System - 409 ms
Hold On Loosely - .38 Special - 279 ms
You May Be Right - Billy Joel - 255 ms
Teacher Teacher - .38 Special - 193 ms
Release It Instrumental - Afro Celt Sound System - 387 ms
Little Bird - Annie Lennox - 279 ms
Dont Talk - 10,000 Maniacs - 321 ms
Keepin Up - Alabama - 185 ms
Pluto - Björk - 199 ms
Ostrichism - A Small, Good Thing - 152 ms
Mountain Music - Alabama - 252 ms
Caught Up In You - .38 Special - 276 ms
Unknown Track - Boards of Canada - 311 ms
in the Morning - LL Cool J - 222 ms
Get Me Off - Basement Jaxx - 289 ms
Fine and Mellow - Billie Holiday - 220 ms
Waltz - Gabriel Yared - 118 ms
Zaar - Peter Gabriel - 178 ms
while my code discovered this path (after I stole the code from James
to pick start and end songs):
Looking for a path between 'Que Sera' and 'Zaar'
#0 - Ace of Base :: Que Sera :: 3:47
#1 - Eric Serra :: A Bomb In The Hotel :: 2:15
#2 - Duke Ellington :: Limbo Jazz :: 5:14
#3 - Peter Gabriel :: Zaar :: 2:59
+0.2s to create playlist
James did a nice two-way search for his solution - instead of walking
an ever-broadening possibility tree going from the start to the
finish, he has the finish also start branching out at the same time
until they meet. From a geometric perspective, a circle centered on
the start point and touching the end point will cover twice the area
of two circles centered on each point which just touch each other.
This would seem to me to be an obvious benefit from both a speed and
memory standpoint. (I had originally thought I would do something
like this, but couldn't come up with a clear idea of how to implement
it.)
I particularly like the incredible simplicity of James' code that
does the work:
until (join_node = start.touch?(finish))
start.grow
finish.grow
end
Ruby code can be so simple and expressive when done right! Another
nice line (from Brian's code):
connections = starts & endings
Unfortunately, when I asked James' code to find the same path as
above, it took a very long time (almost half an hour) to produce a
result. I wish I had more domain knowledge (and time) to research why
this was the case, with an algorithm that I though would be a better
performer. (It does come up with a nice terse path, however:
1: Que Sera by Ace of Base <<227422>>
2: All Through the Night by Cyndi Lauper <<269740>>
3: These Are Days by 10,000 Maniacs <<293067>>
4: Santa Cruz by David Qualey <<131053>>
5: Zaar by Peter Gabriel <<293355>>
I wanted to play more with Brian and Dave's solutions, but they also
took a very long time to finish. The first time I ran Dave's solution
it took over 2 hours and 400MB of RAM, and eventually dumped out
every playlist it could find that matched the criteria I supplied. (A
1-3 song playlist between two letters that I forget, with no time
constraints.)
As noted earlier, Dave's solution gives you the power to limit number
of songs (min, max, and target) and playlist duration (again, min,
max, and target). I wish I had thought of this - being able to take
the number of recursive depths from 26 to something like 10 helps
tremendously in terms of search space. Dave also notes that his code
is a depth-first recursion, and seeing his exclamation point in the
comment
# (recursively, depth-first(!))
makes me realize that when searching for a shortest-path algorithm, a
breadth-first search would almost certainly prove a better performer.
Damn.
One thing I found particularly interesting in Dave's code was his way
to specify Infinity as a starting point for a minimum-value algorithm:
result.inject(1.0/0.0) do |memo, pl|
[ memo, (pl.total_duration - target_duration).abs ].min
end
In Javascript I've done stuff like that starting out with values like
var theMin = Number.MAX_VALUE;
or
var theMin = Infinity;
and had wished for a way to do the same in Ruby. I still wish for a
nice portable Infinity constant, but Dave's solution here will do
nicely for me for the future. Thanks Dave!
Brian's solution version quickly finds a few shortest playlists, but
when it goes on to find an a->z playlist that's as close to 30
minutes as possible, it ... well, it's been over 3 hours and it still
hasn't finished, though it occasionally spits out progress which
looks like it found a new playlist that is a slightly better match
than the last.
Brian's code looks quite nice, however. Brian genericized his
solution to allow for any type of minimization, using his
BasicMatchEvaluator class. While the code is not documented, it would
appear that his PlaytimeEvaluator subclass uses the square of the
difference between the desired time and the actual playlist to
determine fitness. But if that's all it did, it would require finding
every solution - as it is, the #continue? method looks to be designed
to provide a customizable way to determine early on if a path is a
'dead end'.
Ilmari's solution is fast! I don't have the expertise to analyze his
implementation of "Dijkstra's graph search algorithm with a block-
customizable edge cost", but whatever it is, it works. Every solution
took less than 1 second, usually under 1/2 a second.
One optimization that I had thought of, but that nobody implemented
(as far as I could tell) was to find 'gaps'. For example, no song in
the supplied playlist ended with a 'q', so all songs that started
with 'q' could be removed. Similarly for songs beginning with 'x'.
==========================
Show the Result!
==========================
Finally, we get to the end - showing off the end result.
A few people (Gavin, Brian, Ilmari) took the messy milliseconds
duration and made them into something more human readable. Ilmari's
solution stands out in this department as particularly attractive:
Trying to find playlist from "Que Sera" to "Zaar"
Found playlist:
Que Sera (3:47) by Ace of Base
And Now Little Green Bag (0:15) by Steven Wright
Good Friends Are for Keeps (1:08) by The Carpenters
Santa Cruz (2:11) by David Qualey
Zaar (4:53) by Peter Gabriel
(Needs a fixed-width font to display as intended.)
==========================
Summary
==========================
As I noted, I spent far too much time working on petty details, and
didn't get to personally spend as much time playing with algorithms
and research as I would have liked. I'm quite pleased to see all the
solutions everyone submitted, because they showcase various nice Ruby-
isms and programming styles. (Blocks and procs, or custom classes,
used to turn a specific algorithm into a nice generic solution that
can use all sorts of criteria for what a good playlist is.)
I've focused on performance a lot during this summary. The point was
certainly not to write the fastest solution possible; however, in
playing with these sorts of NP-complete[2] problems you sometimes
HAVE to think about speed just to get a solution in a time that isn't
measured in years.
I hope you all had fun playing with this very common, very difficult
problem, even if it was disguised in sheep's clothing.
[1]
http://c2.com/cgi/wiki?PrematureOptimization
[2]
http://en.wikipedia.org/wiki/NP-complete : Some flavors of this
quiz (such as "shortest playlist") are not NP-complete, but I think
other flavors ("playlist closest to 30 minutes") are.