Python parsing iTunes XML/COM

W

william tanksley

I'm trying to convert the URLs contained in iTunes' XML file into a
form comparable with the filenames returned by iTunes' COM interface.

I'm writing a podcast sorter in Python; I'm using iTunes under Windows
right now. iTunes' COM provides most of my data input and all of my
mp3/aac editing capabilities; the one thing I can't access through COM
is the Release Date, which is my primary sorting field. So I read
everything in through COM, then read all the release dates from the
iTunes XML file, then try to join the two together... But so far I
have zero success.

Is there _any_ way to match up tracks between iTunes COM and iTunes
XML? I've spent far too much effort on this. I'm not stuck on using
filenames, if that's a bad idea... But I haven't found anything else
that works, and filenames seem like an obvious solution.

-Wm
 
W

william tanksley

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?

Right now I'm using the following source:

track_id = url2pathname(urlparse(track_id).path)

url2pathname is from urllib; urlparse is from the urlparse module.

The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

Thank you all in advance, and thank you for Python.

-Wm
 
J

John Machin

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?

Right now I'm using the following source:

track_id = url2pathname(urlparse(track_id).path)

url2pathname is from urllib; urlparse is from the urlparse module.

The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

WHAT problems? WHAT non-ASCII characters?? Consider e.g.

# track_id = url2pathname(urlparse(track_id).path)
print repr(track_id)
parse_result = urlparse(track_id).path
print repr(parse_result)
track_id_replacement = url2pathname(parse_result)
print repr(track_id_replacement)

and copy/paste the results into your next posting.
 
P

pyshib

If you want to convert the file names which use standard URL encoding
(with %20 for space, etc) use:

from urllib import unquote
new_filename = unquote(filename)

I have found this does not convert encoded characters of the form
'' so you may have to do that manually. I think these are just
ascii encodings in hexadecimal.
 
W

william tanksley

Thank you for the response. Here's some more info, including a little
that you didn't ask me for but which might be useful.

# track_id = url2pathname(urlparse(track_id).path)
print repr(track_id)
parse_result = urlparse(track_id).path
print repr(parse_result)
track_id_replacement = url2pathname(parse_result)
print repr(track_id_replacement)

The "important" value here is track_id_replacement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode. Could this be a bug in ElementTree's Unicode support? If
so, can I work around it?

Here's one example. The others are similar -- they have the same
things that look like problems to me.

"Buffett Time - Annual Shareholders\xc2\xa0L.mp3"

Note some problems here:

1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.

I tried doing track_id.encode("utf-8"), but it doesn't seem to make
any difference at all.

Of course, my ultimate goal is to compare the track_id to the track_id
I get from iTunes' COM interface, including hashing to the same value
for dict lookups.
and copy/paste the results into your next posting.

In addition to the above results, while trying to get more diagnostic
printouts I got the following warning from Python:

C:\projects\podcasts\podstrand\podcast.py:280: UnicodeWarning: Unicode
equal comparison failed to convert both arguments to Unicode -
interpreting them as being unequal
return track.databaseID == trackLocation

The code that triggered this is as follows:

if trackLocation in self.podcasts:
track = self.podcasts[trackLocation]
if trackRelease:
track.release_date = trackRelease
elif track.is_podcast:
print "No release date:", repr(track.name)
else:
# For the sake of diagnostics, try to find the track.
def track_has_location(track):
return track.databaseID == trackLocation
fillers = filter(track_has_location, self.fillers)
if len(fillers):
return
disabled = filter(track_has_location, self.deferred)
if len(disabled):
return
print "Location not known:", repr(trackLocation)

-Wm
 
J

Jerry Hill

Here's one example. The others are similar -- they have the same
things that look like problems to me.

"Buffett Time - Annual Shareholders\xc2\xa0L.mp3"

Note some problems here:

1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.

I tried doing track_id.encode("utf-8"), but it doesn't seem to make
any difference at all.

I don't have anything to say about your iTunes problems, but encode()
is the wrong method to turn a byte string into a unicode string.
Instead, use decode(), like this:
 
W

william tanksley

I don't have anything to say about your iTunes problems, but encode()
is the wrong method to turn a byte string into a unicode string.
Instead, use decode(), like this:

Awesome... Thank you! I had my mental model of Python turned around
backwards. That's an odd feeling. Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes. Not what I thought it was, but that's cool, makes sense.

At first I thought this fixed my problem, but I had to tweak the
obvious fix to make it work, and I don't understand why.

Fix #1:

track_id = track_id.decode('utf-8')
track_id = url2pathname(urlparse(track_id).path)

That doesn't work -- it produces no error, but the raw bytes appear in
the unicode string.

Fix #2:

track_id = url2pathname(urlparse(track_id).path)
track_id = track_id.decode('utf-8')

This one appears to work. (Although I can't confirm it for sure,
because although all my debug prints are now correct, the overall
application fails in the same way it did before, back before I put in
debug printfs. I'm going to spend some time assuming that the problem
is elsewhere in my code, since at least I definitely fixed one serious
problem.)

I've got a few questions for Python-XML-Unicode experts...

1. Why does the order of those statements matter?
2. Shouldn't it be more correct to decode BEFORE transforming the
string? Why does that kill the decoding?
3. Why is ElementTree dumping raw bytes on me instead of decoding to
UTF-8? The XML file has its encoding set to:
so it seems like it should said:

-Wm
 
S

Stefan Behnel

william said:
Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes.

No, decoding goes from a byte sequence to a Unicode string and encoding goes
from a Unicode string to a byte sequence.

Unicode is not an encoding. A Unicode string is a character sequence, not a
byte sequence.

Stefan
 
J

Jerry Hill

Awesome... Thank you! I had my mental model of Python turned around
backwards. That's an odd feeling. Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes. Not what I thought it was, but that's cool, makes sense.

That's not quite right. Decoding takes a byte string that is already
in a particular encoding and transforms it to unicode. Unicode isn't
a encoding of it's own. Decoding takes a unicode string (which
doesn't have any encoding associated with it), and gives you back a
sequence of bytes in a particular encoding.

This article isn't specific to Python, but it provides a good overview
of unicode and character encodings that may be useful:
http://www.joelonsoftware.com/articles/Unicode.html
 
W

william tanksley

That's not quite right.  Decoding takes a byte string that is already
in a particular encoding and transforms it to unicode.  Unicode isn't
a encoding of it's own.  Decoding takes a unicode string (which
doesn't have any encoding associated with it), and gives you back a
sequence of bytes in a particular encoding.

Okay, this is useful. Thank you for straightening out my mental model.
It makes sense to define strings as just naturally Unicode... and
anything else is in some ways not really a string, although it's
something that might have many of the same methods. I guess this
mental model is being implemented more thoroughly in Py3K... Anyhow,
it makes sense.

I'm still puzzled why I'm getting some non-Unicode out of an
ElementTree's text, though.

-Wm
 
W

william tanksley

william tanksley said:
I'm still puzzled why I'm getting some non-Unicode out of an
ElementTree's text, though.

Now I know.

Okay, my answer is that cElementTree (in Python 2.5) is simply
deranged when it comes to Unicode. It assumes everything's ASCII.

Reference: http://codespeak.net/lxml/compatibility.html

(Note that the lxml version also doesn't handle Unicode correctly; it
errors when XML declares its encoding.)

This is unpleasant, but at least now I know WHY it was driving me
insane.

-Wm
 
S

Stefan Behnel

william said:
Now I know.

Okay, my answer is that cElementTree (in Python 2.5) is simply
deranged when it comes to Unicode. It assumes everything's ASCII.

It does not "assume" that. It *requires* byte strings to be ASCII. If it
didn't enforce that, how could it possibly know what encoding they were using,
i.e. what they were supposed to mean at all? Read the Python Zen, in the face
of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does
exactly the same thing when it comes to implicit conversion between encoded
strings and Unicode strings.

If you want to pass plain ASCII strings, you can either pass a byte string or
a Unicode string (that's a plain convenience feature). If you want to pass
anything that's not ASCII, you *must* pass a Unicode string.

Reference: http://codespeak.net/lxml/compatibility.html

(Note that the lxml version also doesn't handle Unicode correctly; it
errors when XML declares its encoding.)

It definitely does "handle Unicode correctly". Let me guess, you tried passing
XML as a Unicode string into the parser, and your XML declared itself as
This is unpleasant, but at least now I know WHY it was driving me
insane.

You should *really* read a bit about Unicode and byte encodings. Not
understanding a topic is not a good excuse for complaining about it being
broken for you.

Stefan
 
J

John Machin

Thank you for the response. Here's some more info, including a little
that you didn't ask me for but which might be useful.



The "important" value here is track_id_replacement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode.

Appearances can be deceptive. You present no evidence.
Could this be a bug in ElementTree's Unicode support?

It could, yes, but the probability is extremely low.
If
so, can I work around it?

Here's one example. The others are similar -- they have the same
things that look like problems to me.

"Buffett Time - Annual Shareholders\xc2\xa0L.mp3"

Note some problems here:
Where?


1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.

I tried doing track_id.encode("utf-8"), but it doesn't seem to make
any difference at all.

Of course, my ultimate goal is to compare the track_id to the track_id
I get from iTunes' COM interface, including hashing to the same value
for dict lookups.


In addition to the above results,

*WHAT* results? I don't see any repr() output, just your
interpretation of what you think you saw!
 
W

william tanksley

It does not "assume" that. It *requires* byte strings to be ASCII.

You can't encode Unicode into an ASCII string. (Well, except using
UTF-7.) Bad requirement.
If it
didn't enforce that, how could it possibly know what encoding they were using,
i.e. what they were supposed to mean at all? Read the Python Zen, in the face
of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does
exactly the same thing when it comes to implicit conversion between encoded
strings and Unicode strings.

An XML file that begins with the string <?xml encoding="utf-8"?> is
NOT ascii. You don't have to guess what encoding it's in. It's UTF-8.
If you error out when you hit an 8-bit character, you're not going to
be able to process that file. I'm completely lost on why you're
claiming otherwise.

Furthermore, when ElementTree returns (from one of its .text elements)
a string-of-bytes instead of a decoded Unicode string, it doesn't
merely "resist the temptation to guess"; instead, it forces ME to
guess. I've now had to hardcode "utf-8" into my program, when IT just
bypassed and ignored an explicit instruction to use UTF-8. I hope and
assume that iTunes will never switch from UTF-8 to UTF-32 -- if it
does, my code breaks, and I'll probably have to switch away from
ElementTree (I guess that since it requires ASCII it won't even
pretend to handle more than 8 bits per character).
If you want to pass plain ASCII strings, you can either pass a byte string or
a Unicode string (that's a plain convenience feature). If you want to pass
anything that's not ASCII, you *must* pass a Unicode string.

I don't care about strings. I've never passed ElementTree a string.
I'm using a file, a file that's correctly encoded as UTF-8, and it
returns some text elements that are raw bytes (undecoded). I have to
manually decode them.
It definitely does "handle Unicode correctly".

Actually, this is my bad -- I misread the webpage. lxml appears to
handle unicode strings with a declared encoding correctly: it errors
out. That's quite reasonable when confronted with a contradiction.
According to that page, however, the standard ElementTree library
doesn't work that way -- it simply assumes that byte strings are
ASCII.

I'm going to back down on this one, though. I realize that this is a
single paragraph on a third-party website, and it's not really trying
to document the official ElementTree (it's trying to document its own
version, lxml). So it might not be correct, or it might be overly
ambiguous. It might also be talking ONLY about strings, to the
exclusion of file input. I don't know, and I don't have the energy to
debug it, especially since I can't "fix" anything about it even if
something was wrong :).

So I revert to my former position: I don't know why those two lines
have to be in that order for my code to work correctly; I don't even
know why the "encode" line has to be there at all. When I was using
the old Python XML library, I didn't have to worry about encoding or
decoding; everything just worked. I really prefer ElementTree, and I'm
glad I upgraded, but it really looks like encoding is a problem.
Let me guess, you tried passing
XML as a Unicode string into the parser, and your XML declared itself as
having a byte encoding (<?xml encoding="..."?>). How can that *not* be an error?

I thought you just said "resist the temptation to guess"? I didn't
pass a string. I passed a file. It didn't error out; instead, it
produced bytestring-encoded output (not Unicode).

-Wm
 
W

william tanksley

*WHAT* results? I don't see any repr() output, just your
interpretation of what you think you saw!

That *is* the repr. I said it's the repr, and it IS. It's not an
interpretation; it's a screenscrape. Really, truly. If I paste it in
again it'll look the same.

What do you want? Can I post something that will convince you it's a
repr?

Oh well. You guys have been immensely helpful; my mental model of how
Python works was vastly backwards, so it's a relief to get it
corrected. Thanks to that, I was able to hack my code into working. I
wish I could get entirely correct behavior, but at this point the
miscommunication is too strong. I'll settle for the hack I've got now,
and hope iTunes doesn't ever change its XML encoding (hey, I think
I've got cause to be optimistic).

-Wm
 
S

Stefan Behnel

william said:
I didn't
pass a string. I passed a file. It didn't error out; instead, it
produced bytestring-encoded output (not Unicode).

From my experience (and from the source code I have seen so far), ElementTree
does not return UTF-8 encoded strings at the API level. Can you produce any
evidence for your claims? Some code and an XML file that together produce the
result you are talking about? From what you have written so far, it seems far
more likely to me that your code is messed up than that you found a bug in
ElementTree.

Stefan
 
J

John Machin

That *is* the repr. I said it's the repr, and it IS. It's not an
interpretation; it's a screenscrape. Really, truly. If I paste it in
again it'll look the same.

What do you want? Can I post something that will convince you it's a
repr?

Let's try again:
The "important" value here is track_id_replacement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode.
Here's one example. The others are similar -- they have the same
things that look like problems to me.
"Buffett Time - Annual Shareholders\xc2\xa0L.mp3"

ROTFL! I thought the Buffett thing was a Windows filename! What I was
expecting was THREE lots of repr() output, and I'm quite unused to
seeing repr() output with quotes around it instead of apostrophes; how
did you achieve that?

So you're saying that track_id_replacement contains utf8 characters.
It is obtained by track_id_replacement = url2pathname(parse_result).
You don't show us what is in parse_result. url2pathname() is nothing
to do with ElementTree. urlparse() is nothing to do with ElementTree.
You have provided no evidence that ElementTree is doing what you
accuse it of.

Please try again. Backtrack in your code to where you are pulling the
url out of an element. Do print repr(some_element.some_attribute).
Show us.
 
W

william tanksley

John Machin said:
Let's try again:

Cool. Sorry for the misunderstanding. Thank you for helping again!

Postscript: your request to print the actual data did the trick. I'm
including the rest of my reply just to provide context, but the answer
was the the Unicode was actually embedded in the URL, encoded as
distinct bytes. Thus, it *had* to be url-decoded and then UTF-8
decoded, in that order, in order to recover the original filename.

So the problem was indeed purely in my head -- I should have looked at
the original data (unfortunately, I was fooled by looking at the song
title, which is the same thing but with the raw UTF-8 bytes instead of
the URL escape codes).
ROTFL! I thought the Buffett thing was a Windows filename! What I was
expecting was THREE lots of repr() output, and I'm quite unused to
seeing repr() output with quotes around it instead of apostrophes; how
did you achieve that?

I don't know -- but I got it again when I printed out the original
version. My *guess* would be that this is what repr prints when asked
to print a byte string (but I don't know how to confirm that).
Alternately, the fact that I'm running these inside SPE might be
changing some defaults. I'm not sure.

You're right that single quotes are expected -- and I'd expect a
preceding u, since they're supposed to be Unicode. I dunno what's
going on.
So you're saying that track_id_replacement contains utf8 characters.
It is obtained by track_id_replacement = url2pathname(parse_result).
You don't show us what is in parse_result. url2pathname() is nothing
to do with ElementTree. urlparse() is nothing to do with ElementTree.
You have provided no evidence that ElementTree is doing what you
accuse it of.

Okay. Here's the evidence... Or something. Looking at this I begin to
see why things work the way they do. It's utterly bizzare, quite
frankly.
Please try again. Backtrack in your code to where you are pulling the
url out of an element. Do print repr(some_element.some_attribute).
Show us.

Okay, the repr of the string that comes out of the .text attribute is:

"file://localhost/C:/Documents%20and%20Settings/TanksleyJrW/My
%20Documents/My%20Music/iTunes/iTunes%20Music/Podcasts/Brian
%20Preston's%20_Money%20Guy_%20Blog%20and%20Pod/Buffett%20Time%20-
%20Annual%20Shareholders%C2%A0L.mp3"

Looking at the XML, and THIS TIME actually looking at the correct
attribute (I was looking at the title before) I see... surprise!
That's the correct data.

So all of the mysteries are solved (except for my Python's
doublequotes, but who cares), and ElementTree is entirely vindicated.

-Wm
 
J

John Machin

Cool. Sorry for the misunderstanding. Thank you for helping again!

Postscript: your request to print the actual data did the trick.

I'd back inspecting actual data against armchair philosophy any
time :)
I'm
including the rest of my reply just to provide context, but the answer
was the the Unicode was actually embedded in the URL, encoded as
distinct bytes. Thus, it *had* to be url-decoded and then UTF-8
decoded, in that order, in order to recover the original filename.

So the problem was indeed purely in my head -- I should have looked at
the original data (unfortunately, I was fooled by looking at the song
title, which is the same thing but with the raw UTF-8 bytes instead of
the URL escape codes).




I don't know -- but I got it again when I printed out the original
version. My *guess* would be that this is what repr prints when asked
to print a byte string (but I don't know how to confirm that).
Alternately, the fact that I'm running these inside SPE might be
changing some defaults. I'm not sure.

You're right that single quotes are expected -- and I'd expect a
preceding u, since they're supposed to be Unicode. I dunno what's
going on.

Why do you suppose that the contents are Unicode? It's a URL-encoded
string i.e. *deliberately* ASCII, in fact sub-ASCII (see all the %20
stuff?). What's going on is that ElementTree presents text as ASCII if
it can be so represented, otherwise as Unicode. This is actually a
*convenience*. Get used to it. Enjoy it.
Okay. Here's the evidence... Or something. Looking at this I begin to
see why things work the way they do. It's utterly bizzare, quite
frankly.


Okay, the repr of the string that comes out of the .text attribute is:

"file://localhost/C:/Documents%20and%20Settings/TanksleyJrW/My
%20Documents/My%20Music/iTunes/iTunes%20Music/Podcasts/Brian
%20Preston's%20_Money%20Guy_%20Blog%20and%20Pod/Buffett%20Time%20-
%20Annual%20Shareholders%C2%A0L.mp3"

Looking at the XML, and THIS TIME actually looking at the correct
attribute (I was looking at the title before) I see... surprise!
That's the correct data.

So all of the mysteries are solved (except for my Python's
doublequotes, but who cares), and ElementTree is entirely vindicated.

Shucks. I can sense that you'd been looking forward to conducting an
auto-da-fe followed by tossing the author on a bonfire ... but you
can't burn a bot anyway :)
 
W

william tanksley

John Machin said:
I'd back inspecting actual data against armchair philosophy any
time :)

Heh. It's a recurring problem with me, to tell the truth.
Why do you suppose that the contents are Unicode? It's a URL-encoded
string i.e. *deliberately* ASCII, in fact sub-ASCII (see all the %20
stuff?). What's going on is that ElementTree presents text as ASCII if
it can be so represented, otherwise as Unicode. This is actually a
*convenience*. Get used to it. Enjoy it.

This isn't what caused the problem, but how is it convenient to get
Unicode sometimes and ASCII other times? Given that the input file was
Unicode, and in fact some of the values required Unicode, I'd expect
to have gotten Unicode out for everything.

I don't see how it matters; as far as I know, the methods available
for Unicode and ASCII strings are the same, and only the type() is
different. So I'm not saying it's a problem; I'm just not seeing how
it's a _convenience_.
Shucks. I can sense that you'd been looking forward to conducting an
auto-da-fe followed by tossing the author on a bonfire ... but you
can't burn a bot anyway :)

Well, I really _was_ expecting the Spanish Inquisition. Darn.

-Wm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top