handling xml embedded within xml

A

Avowkind

I have a log file within which is contained a dump of an xml message

.... rubbish
///asd laksj aslf
<nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330">
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
</nif_DEBUG>
... more junk
.... then more xml
""")
This example is of course a summary.

I want to write a streaming filter which will throw out all the junk
and just return a series of nice strings of each complete xml
message. Ideally I also want to filter which messages I am interested
in.

e.g. the output from the above would be
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>

Two problems.
1. clearing away junk that is nothing like XML.
2. handling the <? xml declaration that lies inside the other xml
tags.

the first I can handle relatively simply by reading through the string
until I get what looks like a valid XML tag. I can then pass the rest
onto an xml parser like xml.sax. However the parser then excepts out
with :
XMLSyntaxError: XML declaration allowed only at the start of the
document

I would like a more forgiving parser that handles bad xml by a call
back that I can just say carry on to.
Bear in mind also I probably will not have the end of the stream while
initially processing.

All suggestions and pointers welcome
Andrew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top