expat having problems with entities (&)

nnguyen · Dec 11, 2009

I need expat to parse this block of xml:

<datafield tag="991">
<subfield code="b">c-P&P</subfield>
<subfield code="h">LOT 3677</subfield>
<subfield code="m">(F)</subfield>
</datafield>

I need to parse the xml and return a dictionary that follows roughly
the same layout as the xml. Currently the code for the class handling
this is:

class XML2Map():

def __init__(self):
""" """
self.parser = expat.ParserCreate()

self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.char_data

self.map = [] #not a dictionary

self.current_tag = ''
self.current_subfields = []
self.current_sub = ''
self.current_data = ''

def parse_xml(self, xml_text):
self.parser.Parse(xml_text, 1)

def start_element(self, name, attrs):
if name == 'datafield':
self.current_tag = attrs['tag']

elif name == 'subfield':
self.current_sub = attrs['code']

def char_data(self, data):
self.current_data = data

def end_element(self, name):
if name == 'subfield':
self.current_subfields.append([self.current_sub,
self.current_data])

elif name == 'datafield':
self.map.append({'tag': self.current_tag, 'subfields':
self.current_subfields})
self.current_subfields = [] #resetting the values for next
subfields

Right now my problem is that when it's parsing the subfield element
with the data "c-P&P", it's not taking the whole data, but instead
it's breaking it into "c-P", "&", "P". i'm not an expert with expat,
and I couldn't find a lot of information on how it handles specific
entities.

In the resulting map, instead of:

{'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}

I get this:

{'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}

In the debugger, I can see that current_data gets assigned "c-P", then
"&", and then "P".

Any ideas on any expat tricks I'm missing out on? I'm also inclined to
try another parser that can keep the string together when there are
entities, or at least ampersands.

nnguyen · Dec 11, 2009

I need expat to parse this block of xml:

<datafield tag="991">
<subfield code="b">c-P&P</subfield>
<subfield code="h">LOT 3677</subfield>
<subfield code="m">(F)</subfield>
</datafield>

I need to parse the xml and return a dictionary that follows roughly
the same layout as the xml. Currently the code for the class handling
this is:

class XML2Map():

def __init__(self):
""" """
self.parser = expat.ParserCreate()

self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.char_data

self.map = [] #not a dictionary

self.current_tag = ''
self.current_subfields = []
self.current_sub = ''
self.current_data = ''

def parse_xml(self, xml_text):
self.parser.Parse(xml_text, 1)

def start_element(self, name, attrs):
if name == 'datafield':
self.current_tag = attrs['tag']

elif name == 'subfield':
self.current_sub = attrs['code']

def char_data(self, data):
self.current_data = data

def end_element(self, name):
if name == 'subfield':
self.current_subfields.append([self.current_sub,
self.current_data])

elif name == 'datafield':
self.map.append({'tag': self.current_tag, 'subfields':
self.current_subfields})
self.current_subfields = [] #resetting the values for next
subfields

Right now my problem is that when it's parsing the subfield element
with the data "c-P&P", it's not taking the whole data, but instead
it's breaking it into "c-P", "&", "P". i'm not an expert with expat,
and I couldn't find a lot of information on how it handles specific
entities.

In the resulting map, instead of:

{'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}

I get this:

{'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}

In the debugger, I can see that current_data gets assigned "c-P", then
"&", and then "P".

Any ideas on any expat tricks I'm missing out on? I'm also inclined to
try another parser that can keep the string together when there are
entities, or at least ampersands.

I forgot, ignore the "'inds':..." in the output above, it's just
another part of the xml I had to parse that isn't important to this
discussion.

Rami Chowdhury · Dec 11, 2009

Any ideas on any expat tricks I'm missing out on? I'm also inclined to
try another parser that can keep the string together when there are
entities, or at least ampersands.

IIRC expat explicitly does not guarantee that character data will be
handed to the CharacterDataHandler in complete blocks. If you're
certain you want to stay at such a low level, I would just modify your
char_data method to append character data to self.current_data rather
than replacing it. Personally, if I had the option (e.g. Python 2.5+)
I'd use ElementTree...

nnguyen · Dec 11, 2009

IIRC expat explicitly does not guarantee that character data will be
handed to the CharacterDataHandler in complete blocks. If you're
certain you want to stay at such a low level, I would just modify your
char_data method to append character data to self.current_data rather
than replacing it. Personally, if I had the option (e.g. Python 2.5+)
I'd use ElementTree...

Well the appending trick worked. From some logging I figured out that
it was reading through those bits of current_data before getting to
the subfield ending element (which is kinda obvious when you think
about it). So I just used a += and made sure to clear out current_data
when it hits a subfield ending element.

Thanks!

expat parsing error	0	Jun 1, 2010
expat parsing error	10	Jun 1, 2010
Enumerating ordered expat attributes with tuplets?	5	Sep 11, 2008
expat error, help to debug?	4	Aug 23, 2007
registering entities with sax parser	0	Jun 8, 2004
Memory problems (garbage collection)	6	Apr 23, 2009
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Tkinter polling example: file copy with progress bar	7	Dec 12, 2010

expat having problems with entities (&)

nnguyen

nnguyen

Rami Chowdhury

nnguyen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

expat having problems with entities (&amp;)

nnguyen

nnguyen

Rami Chowdhury

nnguyen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

expat having problems with entities (&)