N
nnguyen
I need expat to parse this block of xml:
<datafield tag="991">
<subfield code="b">c-P&P</subfield>
<subfield code="h">LOT 3677</subfield>
<subfield code="m">(F)</subfield>
</datafield>
I need to parse the xml and return a dictionary that follows roughly
the same layout as the xml. Currently the code for the class handling
this is:
class XML2Map():
def __init__(self):
""" """
self.parser = expat.ParserCreate()
self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.char_data
self.map = [] #not a dictionary
self.current_tag = ''
self.current_subfields = []
self.current_sub = ''
self.current_data = ''
def parse_xml(self, xml_text):
self.parser.Parse(xml_text, 1)
def start_element(self, name, attrs):
if name == 'datafield':
self.current_tag = attrs['tag']
elif name == 'subfield':
self.current_sub = attrs['code']
def char_data(self, data):
self.current_data = data
def end_element(self, name):
if name == 'subfield':
self.current_subfields.append([self.current_sub,
self.current_data])
elif name == 'datafield':
self.map.append({'tag': self.current_tag, 'subfields':
self.current_subfields})
self.current_subfields = [] #resetting the values for next
subfields
Right now my problem is that when it's parsing the subfield element
with the data "c-P&P", it's not taking the whole data, but instead
it's breaking it into "c-P", "&", "P". i'm not an expert with expat,
and I couldn't find a lot of information on how it handles specific
entities.
In the resulting map, instead of:
{'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}
I get this:
{'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}
In the debugger, I can see that current_data gets assigned "c-P", then
"&", and then "P".
Any ideas on any expat tricks I'm missing out on? I'm also inclined to
try another parser that can keep the string together when there are
entities, or at least ampersands.
<datafield tag="991">
<subfield code="b">c-P&P</subfield>
<subfield code="h">LOT 3677</subfield>
<subfield code="m">(F)</subfield>
</datafield>
I need to parse the xml and return a dictionary that follows roughly
the same layout as the xml. Currently the code for the class handling
this is:
class XML2Map():
def __init__(self):
""" """
self.parser = expat.ParserCreate()
self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.char_data
self.map = [] #not a dictionary
self.current_tag = ''
self.current_subfields = []
self.current_sub = ''
self.current_data = ''
def parse_xml(self, xml_text):
self.parser.Parse(xml_text, 1)
def start_element(self, name, attrs):
if name == 'datafield':
self.current_tag = attrs['tag']
elif name == 'subfield':
self.current_sub = attrs['code']
def char_data(self, data):
self.current_data = data
def end_element(self, name):
if name == 'subfield':
self.current_subfields.append([self.current_sub,
self.current_data])
elif name == 'datafield':
self.map.append({'tag': self.current_tag, 'subfields':
self.current_subfields})
self.current_subfields = [] #resetting the values for next
subfields
Right now my problem is that when it's parsing the subfield element
with the data "c-P&P", it's not taking the whole data, but instead
it's breaking it into "c-P", "&", "P". i'm not an expert with expat,
and I couldn't find a lot of information on how it handles specific
entities.
In the resulting map, instead of:
{'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}
I get this:
{'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'],
[u'm', u'(F)']], 'inds': [u' ', u' ']}
In the debugger, I can see that current_data gets assigned "c-P", then
"&", and then "P".
Any ideas on any expat tricks I'm missing out on? I'm also inclined to
try another parser that can keep the string together when there are
entities, or at least ampersands.