Extracting xml from html

K

kyosohma

Hi,

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 69, in ?
inst = ApptParser(url)
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 19, in __init__
xml = self.getXml(url)
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 30, in getXml
doc = xml.dom.minidom.parse(f)
File "C:\Python24\lib\xml\dom\minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

Here's a sample of the html:

<html>
<body>
lots of screwy text including divs and spans
<Row status="o">
<RecordNum>1126264</RecordNum>
<Make>Mitsubishi</Make>
<Model>Mirage DE</Model>
</Row>
</body>
</html>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Thanks a lot!

Mike
 
P

Paul Boddie

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Probably easiest is to use an XML processing toolkit or library which
supports HTML parsing. Since the libxml2 library (written in C) makes
a fairly good job of HTML parsing, I would suggest either libxml2dom
(for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
work, but the programming style is a bit more convoluted unless you're
used to XML processing using a SAX-like API.

Paul

P.S. I'm biased towards libxml2dom, being the developer, but I use it
routinely and it generally does the job for me.
 
K

kyosohma

Probably easiest is to use an XML processing toolkit or library which
supports HTML parsing. Since the libxml2 library (written in C) makes
a fairly good job of HTML parsing, I would suggest either libxml2dom
(for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
work, but the programming style is a bit more convoluted unless you're
used to XML processing using a SAX-like API.

Paul

P.S. I'm biased towards libxml2dom, being the developer, but I use it
routinely and it generally does the job for me.

I have lxml installed and I appear to also have libxml2dom installed.
I know lxml has decent docs, but I don't see much for yours. Is this
the only place to go: http://www.boddie.org.uk/python/libxml2dom.html
?

Mike
 
G

Gabriel Genellina

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

So your HTML is not a well formed XML document, as many html pages, and
you can't use an XML parser. (even a valid HTML document may not be valid
XML). Let's try with some mismatched tags:

py> text = '''<html>
.... <body>
.... <p>lots of <div>screwy text including divs and <span>spans</p>
.... <Row status="o">
.... <RecordNum>1126264</RecordNum>
.... <Make>Mitsubishi</Make>
.... <Model>Mirage DE</Model>
.... </Row>
.... </body>
.... </html>'''
py>
py> import xml.dom.minidom
py> doc = xml.dom.minidom.parseString(text)
Traceback (most recent call last):
....
xml.parsers.expat.ExpatError: mismatched tag: line 3, column 60

You will need a more robust parser, like BeautifulSoup
<http://www.crummy.com/software/BeautifulSoup/>

py> from BeautifulSoup import BeautifulSoup
py> soup = BeautifulSoup(text)
py> for row in soup.findAll("row"):
.... print row.recordnum, row.make.contents, row.model.string
....
<recordnum>1126264</recordnum> [u'Mitsubishi'] Mirage DE

Depending on your document, you may prefer to extract the XML blocks using
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
parser) or xml.etree.ElementTree
 
S

Stefan Behnel

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this.
Here's a sample of the html:

<html>
<body>
lots of screwy text including divs and spans
<Row status="o">
<RecordNum>1126264</RecordNum>
<Make>Mitsubishi</Make>
<Model>Mirage DE</Model>
</Row>
</body>
</html>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

lxml makes this pretty easy:

This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:

Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan
 
P

Paul Boddie

I have lxml installed and I appear to also have libxml2dom installed.
I know lxml has decent docs, but I don't see much for yours. Is this
the only place to go:http://www.boddie.org.uk/python/libxml2dom.html
?

Unfortunately yes, with regard to online documentation, although the
distribution contains API documentation, and the package has
docstrings for most of the public classes, functions and methods. And
the API is a lot like the PyXML and minidom APIs, too.

Paul
 
K

kyosohma

So your HTML is not a well formed XML document, as many html pages, and
you can't use an XML parser. (even a valid HTML document may not be valid
XML). Let's try with some mismatched tags:
Depending on your document, you may prefer to extract the XML blocks using
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
parser) or xml.etree.ElementTree

Thanks for the reply. I already knew about BeautifulSoup but I was
hoping to avoid installing *yet another module* on my PC. I got it to
work with lxml, but it's not very pretty. See my reply to Stefan.

Mike
 
K

kyosohma

lxml makes this pretty easy:


This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:


Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan

I got it to work with lxml. See below:

def Parser(filename):
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)
events = ("recordnum", "primaryowner", "customeraddress")
context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

Mike
 
G

George Sakkis

Thanks for the reply. I already knew about BeautifulSoup but I was
hoping to avoid installing *yet another module* on my PC.

That's a poor excuse for a self-contained module in a single file.
"Installing" it can be as simple as having it in the same directory of
your module that imports it. Given that you can do in 2 lines what
took you around 15 with lxml, I wouldn't think it twice.

George
 
S

Stefan Behnel

George said:
Given that you can do in 2 lines what
took you around 15 with lxml, I wouldn't think it twice.

Don't judge a tool by beginner's code.

Stefan
 
L

Laurent Pointal

(e-mail address removed) a écrit :
I got it to work with lxml. See below:

def Parser(filename):
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)
events = ("recordnum", "primaryowner", "customeraddress")
context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

Mike

Q? Once you get your document into an XML tree in memory, while do you
go to event-based handling to extract your data ?

Try to directly manipulate the tree.

parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
myrows = tree.findall(".//Row")

# Then work with the sub-elements.
for r in myrows :
rnumelem = r.find("RecordNum")
makeeleme = r.find("Make")
modelelem = r.find("Model")

& co.
 
S

Stefan Behnel

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

def Parser(filename):

It's uncommon to give a function a capitalised name, unless it's a factory
function (which this isn't).

parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)

What you do here is parse the HTML page and serialise it back into an XML
string. No need to do that - once it's a tree, you can work with it. lxml is a
highly integrated set of tools, no matter if you use it for XML or HTML.

events = ("recordnum", "primaryowner", "customeraddress")

You're not using this anywhere below, so I assume this is left-over code.

context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Admittedly, iterparse() doesn't currently support HTML (although this might
become possible in lxml 2.0).

You could do this more easily in a couple of ways. One is to use XPath:

print [el.text for el in tree.xpath("//primaryowner|//customeraddress")]

Note that this works directly on the tree that you retrieved right in the
third line of your code.

Another (and likely simpler) solution is to first find the "Row" element and
then start from that:

row = tree.find("//Row")
print row.findtext("primaryowner")
print row.findtext("customeraddress")

See the lxml tutorial on this, as well as the documentation on XPath support
and tree iteration:

http://codespeak.net/lxml/xpathxslt.html#xpath
http://codespeak.net/lxml/api.html#iteration

Hope this helps,
Stefan
 
K

kyosohma

It's uncommon to give a function a capitalised name, unless it's a factory
function (which this isn't).

Yeah. I was going to use a class (and I still might), so that's how it
got capitalized.

You're not using this anywhere below, so I assume this is left-over code.

I realized I didn't need that line soon after I posted. Sorry about
that!

You could do this more easily in a couple of ways. One is to use XPath:

print [el.text for el in tree.xpath("//primaryowner|//customeraddress")]

This works quite well. Wish I'd thought of it.
Note that this works directly on the tree that you retrieved right in the
third line of your code.

Another (and likely simpler) solution is to first find the "Row" element and
then start from that:

row = tree.find("//Row")
print row.findtext("primaryowner")
print row.findtext("customeraddress")

I tried this your way and Laurent's way and both give me this error:

AttributeError: 'NoneType' object has no attribute 'findtext'

See the lxml tutorial on this, as well as the documentation on XPath support
and tree iteration:

http://codespeak.net/lxml/xpathxslt.html#xpathhttp://codespeak.net/lxml/api.html#iteration

Hope this helps,
Stefan

I'm not sure what George's deal is. I'm not a beginner with Python,
just with lxml. I don't have all the hundreds of modules of Python
memorized and I have yet to meet any one who does. Even if I had used
Beautiful Soup, my code would probably still suck and I was told
explicitly by my boss to avoid adding new dependencies to my programs
whenever possible.

Thanks for the help. I'll add the list comprehension to my code.

Mike
 
S

Stefan Behnel

I tried this your way and Laurent's way and both give me this error:

AttributeError: 'NoneType' object has no attribute 'findtext'

Well, error handling is up to you. If find() doesn't find what you are looking
for, it will return None. Note that tag names are case sensitive - or maybe
there are namespaces involved, cannot tell from the example you posted.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top