how to use structured markup tools

Sean McIlroy · Mar 19, 2005

I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

..
..
<a>x1<c>y1</c></a>
..
..
<a>x2<c>y2</c></a>
..
..
<a>x3<c>y3</c></a>
..
..

....I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

Peace,
STM

Fredrik Lundh · Mar 19, 2005

Sean said:
I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...
.
<a>x1<c>y1</c></a>
.
<a>x2<c>y2</c></a>
.
<a>x3<c>y3</c></a>
.
...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

how about:

from elementtree import ElementTree

TEXT = """\
<doc>
<a>x1<c>y1</c></a>
<a>x2<c>y2</c></a>
<a>x3<c>y3</c></a>
</doc>
"""

tree = ElementTree.XML(TEXT)

data = []

for elem in tree.findall(".//a"):
data.append((elem.findtext("b"), elem.findtext("c")))

print data

=> [('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3')]

more here:

http://effbot.org/zone/element-index.htm

</F>

Sean McIlroy · Mar 19, 2005

Exactly what I was looking for. Thanks.

Uche Ogbuji · Mar 23, 2005

I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

.
.
<a>x1<c>y1</c></a>
.
.
<a>x2<c>y2</c></a>
.
.
<a>x3<c>y3</c></a>
.
.

...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a>x1<c>y1</c></a>
<a>x2<c>y2</c></a>
<a>x3<c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a>x1<c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xml/library/x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xml/library/x-tiplook2.html

Shoelace Formula	5	Nov 3, 2024
Help for ActionPerformance and how to use HashMap.	2	Feb 10, 2022
How to use Densenet121 in monai	0	Feb 16, 2024
Remove Start Button from Clock	2	Jan 16, 2021
Need help with this Python code.	2	Jun 13, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Graph of quadratic function with CanvasRenderingContext2D	2	May 9, 2024
Trying to use clangd with VSCodium, CMake_World_COMPILER not set	1	Nov 5, 2024

how to use structured markup tools

Sean McIlroy

Fredrik Lundh

Sean McIlroy

Uche Ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads