A
Avowkind
I have a log file within which is contained a dump of an xml message
.... rubbish
///asd laksj aslf
<nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330">
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
</nif_DEBUG>
... more junk
.... then more xml
""")
This example is of course a summary.
I want to write a streaming filter which will throw out all the junk
and just return a series of nice strings of each complete xml
message. Ideally I also want to filter which messages I am interested
in.
e.g. the output from the above would be
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
Two problems.
1. clearing away junk that is nothing like XML.
2. handling the <? xml declaration that lies inside the other xml
tags.
the first I can handle relatively simply by reading through the string
until I get what looks like a valid XML tag. I can then pass the rest
onto an xml parser like xml.sax. However the parser then excepts out
with :
XMLSyntaxError: XML declaration allowed only at the start of the
document
I would like a more forgiving parser that handles bad xml by a call
back that I can just say carry on to.
Bear in mind also I probably will not have the end of the stream while
initially processing.
All suggestions and pointers welcome
Andrew
.... rubbish
///asd laksj aslf
<nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330">
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
</nif_DEBUG>
... more junk
.... then more xml
""")
This example is of course a summary.
I want to write a streaming filter which will throw out all the junk
and just return a series of nice strings of each complete xml
message. Ideally I also want to filter which messages I am interested
in.
e.g. the output from the above would be
<?xml version="1.0" encoding="UTF-8"?>
<ns>
<PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
Two problems.
1. clearing away junk that is nothing like XML.
2. handling the <? xml declaration that lies inside the other xml
tags.
the first I can handle relatively simply by reading through the string
until I get what looks like a valid XML tag. I can then pass the rest
onto an xml parser like xml.sax. However the parser then excepts out
with :
XMLSyntaxError: XML declaration allowed only at the start of the
document
I would like a more forgiving parser that handles bad xml by a call
back that I can just say carry on to.
Bear in mind also I probably will not have the end of the stream while
initially processing.
All suggestions and pointers welcome
Andrew