how to use structured markup tools

S

Sean McIlroy

I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

..
..
<a><b>x1</b><c>y1</c></a>
..
..
<a><b>x2</b><c>y2</c></a>
..
..
<a><b>x3</b><c>y3</c></a>
..
..

....I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

Peace,
STM
 
F

Fredrik Lundh

Sean said:
I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...
.
<a><b>x1</b><c>y1</c></a>
.
<a><b>x2</b><c>y2</c></a>
.
<a><b>x3</b><c>y3</c></a>
.
...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

how about:

from elementtree import ElementTree

TEXT = """\
<doc>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</doc>
"""

tree = ElementTree.XML(TEXT)

data = []

for elem in tree.findall(".//a"):
data.append((elem.findtext("b"), elem.findtext("c")))

print data

=> [('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3')]

more here:

http://effbot.org/zone/element-index.htm

</F>
 
U

Uche Ogbuji

I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

.
.
<a><b>x1</b><c>y1</c></a>
.
.
<a><b>x2</b><c>y2</c></a>
.
.
<a><b>x3</b><c>y3</c></a>
.
.

...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a><b>x1</b><c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)


--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xml/library/x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xml/library/x-tiplook2.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top