Noob trying to parse bad HTML using xml.etree.ElementTree

M

Morten Guldager

'Aloha Friends!

I'm trying to process some HTML using xml.etree.ElementTree
Problem is that the HTML I'm trying to read have some not properly closed
tags, as the <img> shown in line 8 below.

1 from xml.etree import ElementTree
2
3 tree = ElementTree
4 e = tree.fromstring(
5 """
6 <html>
7 <body>
8 <img src='mogul.jpg'>
9 </body>
10 </html>
11 """)

Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5,
column 14

I definitely do want to work DOM style, having the whole shebang loaded
into a nice structure before I start the real work.

Question is if it's possible to tweak xml.etree.ElementTree to accept, and
understand sloppy html, or if you have suggestions for similar easy to use
framework, preferably among the included batteries?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,697
Latest member
AugustNabo

Latest Threads

Top