Parsing Tracklisting - discussion

M

moltar

I am writing a subroutine to parse DJ mix tracklistings into artist, track name and additional
information. I think It could be even a great module if it all works out well. So far my solution is pretty
ugly, but that is why I decided to raise the discussion over here with hundreds of Perl Gurus.

A tracklist consist of several lines of artist and song names, sometimes additional information is
provided such as label, or "remix by", etc... Examples are at the end of message. I want to parse it into
separate instances ($artist, $track, $extra).

I had several ideas to tuckle this problem.

1) A set of if's to satisfy individual cases. Obvisoly one regex cannot cut it.
2) Array of regex'es going from most common to less common track listing patterns. Loop and
execute regex pattern per line.
3) more not so bright ideas that I won't mention.

So far my solution looks something like the following. It's still in works. It's clumsy and dirty. I am
looking for something more logical and elegant.

sub parse_tracklist_item {
my $line = shift;

if ($line =~ / [-:]+ /) {
# plug - tuff rinse - blue angel
# dj krush - meiso ( dj crystral's drug deal vocal mix) - mo' wax
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ ([^(]+?) [-:]+ ([^\(]+)?/;
# Artist - Track (Extra)
# J-Cut - They Don’t Know (Advanced Dub)
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ ([^(]+) [\(\[]\s*([^)]+)\s*[\)\]]( - .*)?/;
# Duo Infernele - Positive Vibes
# T Power - Delta
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ (.+?)$/;
} elsif ($line =~ /-/) {
# Qwest feat. Jamal-Where My Thugs At? (A-Sides Remix)
# NRG-I Need Your Lovin (Remix)
return ($line, $1, $2, $3) if ($line =~ /^([^\-]+)\-([^\(]+)\s*\(?([^)]+)?\)?/);
}
# Get The Record Straight (Pieter K)
# Amen Bizness (Macc + Chris Inperspective)
# Are You Someone'S Prayer? (Fanu): Miracle Rmx.
return ($line, $1, $2, $3) if ($line =~ /^([^\(]+) \(([^)]+)\)(?:\:(.*))?/);
# stricken roots / fex
# the way (evil nine remix) / dylan rhymes + pearlshot accapella

if ($line =~ m|([^(]+):)?\s*\(([^)]+)\)\s*)? / (.+)|) {
return ($line, $1, $3, $2);
}

return ($line, $line);
}

_____________________________________________________________________
Here are a few real examples of track listing variations that I found around the net. This is of course not
limited to what people can come up with.

Zinc - Star Of Polaris (Bingo Dub)
J-Cut - They Don’t Know (Advanced Dub)

* 01 - plug - tuff rinse - blue angel
* 02 - dj krush - meiso ( dj crystral's drug deal vocal mix) - mo' wax

Duo Infernele - Positive Vibes
T Power - Delta

Are You Someone'S Prayer? (Fanu): Miracle Rmx.
Amen Bizness (Macc + Chris Inperspective)
Get The Record Straight (Pieter K)

Omni Trio>>Rollin Heights>>Moving Shadow 044
ED Rush & Nico>>Bludclot>>No U Turn
DJ Hype>>Shot In The Dark>> Sub Base Records 020

intersperce - equanimity - looking good records (9m 43s)
makoto - voices - good looking records (4m 41s)

Qwest feat. Jamal-Where My Thugs At? (A-Sides Remix)
Kane and Dynamic Duo-Doc Stoppa
NRG-I Need Your Lovin (Remix)
N20 Jungle Tools 3.5-Crunch

Hive/Echo/Tejada-FearAndLoathing
Loxy/Keaton-JudgementDay

Cause4Concern-Volcam-Subtitles
Hexer-UndergroundResistance-KrushGrooves

08_Break - Submerged :: SUBTITLES039
09_Ryme Tyme & Trace - Move 2005 (Universal Project Remix) :: 1210008

ASC - Alternate Souce [Gamma Ray MP3 release]
Twister - Watercolour [Scientific MP3 release]

Future Prophecies - Voice of Loneliness
Ink and J-Dubb - War Machine

Charlie - The Prodigy - XL Recordings
Hold It Down - 2 Bad Mice - Moving Shadow

Technical Itch - Soldiers - Penetration - 29.00
Stakka & Gridlock - Hit n Run - Cargo - 32.10

Black Sun Empire :: B’Negative (Ill Skillz rmx) :: Ill Skillz
Blame :: Cyberun :: 720 Degrees

4hero : 9 by 9 (MIST vocal rmx)
MIST & Jenna G : Lover
High Contrast : Music is everything (Influx datum rmx)

D.Kay :: Platinum (Ill.Skillz RMX)
Jammin :: Kinda Funky (Shimon RMX)
 
M

Mark Clements

moltar said:
I am writing a subroutine to parse DJ mix tracklistings into artist, track name and
additional information. I think It could be even a great module if it all works out well. So
far my solution is pretty ugly, but that is why I decided to raise the discussion over here
with hundreds of Perl Gurus.

A tracklist consist of several lines of artist and song names, sometimes additional
information is provided such as label, or "remix by", etc... Examples are at the end of
message. I want to parse it into separate instances ($artist, $track, $extra).

I had several ideas to tuckle this problem.

1) A set of if's to satisfy individual cases. Obvisoly one regex cannot cut it.
2) Array of regex'es going from most common to less common track listing patterns. Loop and
execute regex pattern per line.
3) more not so bright ideas that I won't mention.

So far my solution looks something like the following. It's still in works. It's clumsy and
dirty. I am looking for something more logical and elegant.

sub parse_tracklist_item {
my $line = shift;

if ($line =~ / [-:]+ /) {
# plug - tuff rinse - blue angel
# dj krush - meiso ( dj crystral's drug deal vocal mix) - mo' wax
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ ([^(]+?) [-:]+ ([^\(]+)?/;
# Artist - Track (Extra)
# J-Cut - They Dont Know (Advanced Dub)
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ ([^(]+)
[\(\[]\s*([^)]+)\s*[\)\]]( - .*)?/; # Duo Infernele - Positive Vibes
# T Power - Delta
return ($line, $1, $2, $3) if $line =~ /^(.+?) [-:]+ (.+?)$/;
} elsif ($line =~ /-/) {
# Qwest feat. Jamal-Where My Thugs At? (A-Sides Remix)
# NRG-I Need Your Lovin (Remix)
return ($line, $1, $2, $3) if ($line =~ /^([^\-]+)\-([^\(]+)\s*\(?([^)]+)?\)?/);
}

<snip>
There are a number of ways of doing this: I would probably put regexs and positions into
external configuration, using something like Config::properties or similar.

So in the properties file you would have:

mp3.matchers.1.regex=^(.+?) [-:]+ ([^(]+?) [-:]+ ([^\(]+)?
mp3.matchers.2.regex=/^(.+?) [-:]+ ([^(]+)

I wrote Config::propertiesSequence a while ago to handle multiple numbered properties like
this, but as far as I know nobody uses it but me. You could also use XML, but this might be a
bit heavy-duty for the task at hand. There are a number of configuration file modules on CPAN
you could look at.

Whichever method you use, you can then compile these strings as regexs and iterate over them.

push @mp3TitleRegexs, qr($_) foreach @rawRegexString;

....

foreach $mp3TitleRegex(@mp3TitleRegexs){
if( $line =~ $mp3TitleRegex ){
# do whatever
}

}



Mark
 
B

Bob

Very interesting topic. I tried doing this and gave up. Too many
possible combinations and permutations. For example, how do you know
which is the artist, and which is the title, and how can you be sure?

It appears to me that you are just using the filename. What about
examining the tags within the file? And then the other problem is
making sure that the MP* tags and the files names match. A tool to
sync all of these types of infomration, and checkup names from some
external sourse (e.g. Amazon, cddb, etc) has long been a goal of mine.
But this is really a complicated task.

If you create a module, I would love to know about it.

The external config file sounds like a great idea, because then you
could add, remove, mix and match as you want. Even comment some out,
for testing, and you can leave the source code alone. You could maybe
have difference config files for different 'standards', or groups of
music, speech recordings, etc.
 
M

Mark Clements

Bob said:
Very interesting topic. I tried doing this and gave up. Too many
possible combinations and permutations. For example, how do you know
which is the artist, and which is the title, and how can you be sure?

It appears to me that you are just using the filename. What about
examining the tags within the file? And then the other problem is
making sure that the MP* tags and the files names match. A tool to
sync all of these types of infomration, and checkup names from some
external sourse (e.g. Amazon, cddb, etc) has long been a goal of mine.
But this is really a complicated task.

There are a number of modules on CPAN that do this, though I haven't
used any of them. Check out

Audio::File
CDDB
CDDB::File

for instance.

search.cpan.org

Mark
 
M

Moltar

Very interesting topic. I tried doing this and gave up. Too many
possible combinations and permutations. For example, how do you know
which is the artist, and which is the title, and how can you be sure?

Well, there is more to the story. I am actually writing this for a web app. I was thinking to allow user to
pick what is an artist and what is a track name. But for general use, it would just guess wrong
sometimes. I don't think there is a safe way to get this 100% right. Though, I'd love to have 100%
correctness, I know it's just not possible without human intervension.
It appears to me that you are just using the filename. What about
examining the tags within the file? And then the other problem is
making sure that the MP* tags and the files names match. A tool to
sync all of these types of infomration, and checkup names from some
external sourse (e.g. Amazon, cddb, etc) has long been a goal of mine.
But this is really a complicated task.

I am not using filenames either. It could be used for filename parsing, but the input I was after is just
copy-paste from a webpage. Many DJs upload their mixes online along with a tracklist for each mix.
It's just a list without the filenames. But I was thinking of adopting it to file names if it's worth it.

Here are a few examples:
http://www.artelectro.net/mixcentral/phpBB2/music_page.php?song_id=1047
http://www.artelectro.net/mixcentral/mix-1045.html
http://www.artelectro.net/mixcentral/mix-1042.html
and even like this: http://www.artelectro.net/mixcentral/mix-1039.html
If you create a module, I would love to know about it.

I would definetely report back here if it becomes a module.
The external config file sounds like a great idea, because then you
could add, remove, mix and match as you want. Even comment some out,
for testing, and you can leave the source code alone. You could maybe
have difference config files for different 'standards', or groups of
music, speech recordings, etc.

Yes it does sound good! I think I will do just that.
 
M

Moltar

Why can't I just use plain text file with 1 regex per line?
Are there any obstacles?
Is Config::properties faster?
 

Members online

No members online now.

Forum statistics

Threads
474,171
Messages
2,570,935
Members
47,472
Latest member
KarissaBor

Latest Threads

Top