T
Tim Arnold
Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''
I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:
<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>
I'm beginning to think this an impossible task, so I'm asking here to see if
there is some method that will work. What I've done so far is this:
(outer encloses the parent, outside is the parent, inside is the child to
remove)
from lxml import etree
import copy
def rm_tag(elem, outer, outside, inside):
newdiv = etree.Element(outside)
newdiv.text = ''
for e0 in elem.getiterator(outside):
for i,e1 in enumerate(e0.getiterator()):
if i == 0:
if e1.text: newdiv.text += e1.text
elif (e1.tag != inside):
newdiv.append(copy.deepcopy(e1))
elif (e1.text):
newdiv.text += e1.text
for t in elem.getiterator():
if t.tag == outer:
t.clear()
t.append(newdiv)
break
return etree.ElementTree(elem)
print
etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)
But the text is messed up using this method. I see why it's wrong, but not
how to make it right.
It returns:
<option>
<optional> first text
emphasized text
<someelement>ladida</someelement>
<anotherelement/>
last text
</optional>
</option>
Maybe I should send the outside element (via tostring) to a regexp for
removing the child and return that string? Regexp? Getting desperate, hey.
Any pointers much appreciated,
--Tim Arnold
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''
I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:
<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>
I'm beginning to think this an impossible task, so I'm asking here to see if
there is some method that will work. What I've done so far is this:
(outer encloses the parent, outside is the parent, inside is the child to
remove)
from lxml import etree
import copy
def rm_tag(elem, outer, outside, inside):
newdiv = etree.Element(outside)
newdiv.text = ''
for e0 in elem.getiterator(outside):
for i,e1 in enumerate(e0.getiterator()):
if i == 0:
if e1.text: newdiv.text += e1.text
elif (e1.tag != inside):
newdiv.append(copy.deepcopy(e1))
elif (e1.text):
newdiv.text += e1.text
for t in elem.getiterator():
if t.tag == outer:
t.clear()
t.append(newdiv)
break
return etree.ElementTree(elem)
etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)
But the text is messed up using this method. I see why it's wrong, but not
how to make it right.
It returns:
<option>
<optional> first text
emphasized text
<someelement>ladida</someelement>
<anotherelement/>
last text
</optional>
</option>
Maybe I should send the outside element (via tostring) to a regexp for
removing the child and return that string? Regexp? Getting desperate, hey.
Any pointers much appreciated,
--Tim Arnold