M
Mike Driscoll
Hi,
I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.
The XML looks like this:
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>1</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, John</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>2</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, Jane</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
It appears to be enclosed with <XML
id="grdRegistrationInquiryCustomers"><BoundData>
The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.
I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?
I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.
Thank you very much!
Mike
I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.
The XML looks like this:
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>1</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, John</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>2</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, Jane</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
It appears to be enclosed with <XML
id="grdRegistrationInquiryCustomers"><BoundData>
The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.
I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?
I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.
Thank you very much!
Mike