lxml removing tag, keeping text order

Tim Arnold · Oct 24, 2008

Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''

I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:

<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>

I'm beginning to think this an impossible task, so I'm asking here to see if
there is some method that will work. What I've done so far is this:

(outer encloses the parent, outside is the parent, inside is the child to
remove)
from lxml import etree
import copy
def rm_tag(elem, outer, outside, inside):
newdiv = etree.Element(outside)
newdiv.text = ''
for e0 in elem.getiterator(outside):
for i,e1 in enumerate(e0.getiterator()):
if i == 0:
if e1.text: newdiv.text += e1.text
elif (e1.tag != inside):
newdiv.append(copy.deepcopy(e1))
elif (e1.text):
newdiv.text += e1.text

for t in elem.getiterator():
if t.tag == outer:
t.clear()
t.append(newdiv)
break
return etree.ElementTree(elem)

print
etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)

But the text is messed up using this method. I see why it's wrong, but not
how to make it right.
It returns:
<option>
<optional> first text
emphasized text
<someelement>ladida</someelement>
<anotherelement/>
last text
</optional>
</option>

Maybe I should send the outside element (via tostring) to a regexp for
removing the child and return that string? Regexp? Getting desperate, hey.

Any pointers much appreciated,
--Tim Arnold

Stefan Behnel · Oct 25, 2008

Tim said:
Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''

I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:

<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>

There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
what you want. Just copy the code over to your code base and adapt it as needed.

Stefan

Tim Arnold · Oct 27, 2008

Stefan Behnel said:
There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
what you want. Just copy the code over to your code base and adapt it as
needed.

Stefan

Thanks Stefan, I was going crazy with this. That method is going to be quite
useful for my project and it's good to learn from too; I was making it too
hard.

thanks,
--Tim Arnold

anybody help me	1	Feb 10, 2006
ANN: Sequel 2.0.0 Released	0	Jun 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Feb 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 15, 2007

lxml removing tag, keeping text order

Tim Arnold

Stefan Behnel

Tim Arnold

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads