converting text and spans to an ElementTree

Steven Bethard · May 22, 2007

I have some text and a list of Element objects and their offsets, e.g.::

>>> text = 'aaa aaa aaabbb bbbaaa'
>>> spans = [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'

Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?

Thanks,

STeVe

Gabriel Genellina · May 22, 2007

En Tue, 22 May 2007 03:02:34 -0300, Steven Bethard

I have some text and a list of Element objects and their offsets, e.g.::

text = 'aaa aaa aaabbb bbbaaa'
spans = [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'

Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?

I need *some* sleep, but the idea would be as follows:

- For each span generate two tuples: (start_offset, 1, end_offset,
element) and (end_offset, 0, -start_offset, element). If start==end use
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when
at a given offset there is an element that ends and another that starts,
the ending element comes first (because it has a 0). For all the elements
that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and
create a root Element as your "current container" (CC), the variable
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
- if the second item is a 1, an element is starting. Insert it into the
CC element, push the CC onto the stack, and set the new element as the new
CC. The element data is text[last_used:start_offset], and move last_used
to start_offset.
- if the second item is a 0, an element is ending. Discard the CC, pop
an element from the stack as the new CC. The element data is
text[last_used:end_offset], move last_used up to end_offset.
- if the second item is a -1, it's an element with no content. Insert it
into the CC element.

You can play with the way the tuples are generated and sorted, to get
'<a>aaa aaa aaabbb bbb<c />aaa</a>' instead.

attn.steven.kuo · May 22, 2007

I have some text and a list of Element objects and their offsets, e.g.::

text = 'aaa aaa aaabbb bbbaaa'
spans = [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'

Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?

It seems you're looking to construct an Interval Tree:

http://en.wikipedia.org/wiki/Interval_tree

Steven Bethard · May 22, 2007

I have some text and a list of Element objects and their offsets, e.g.::

text = 'aaa aaa aaabbb bbbaaa'
spans = [

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::

tree = get_tree(text, spans)
etree.tostring(tree)

Click to expand...

'<a>aaa aaa aaabbb bbb<c />aaa</a>'

Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?

Click to expand...

It seems you're looking to construct an Interval Tree:

http://en.wikipedia.org/wiki/Interval_tree

No, I'm looking to construct an ElementTree from intervals. ;-) Could
you elaborate on how an Interval Tree would help me?

STeVe

attn.steven.kuo · May 22, 2007

I have some text and a list of Element objects and their offsets, e.g.::

text = 'aaa aaa aaabbb bbbaaa'
spans = [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'

Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?

It seems you're looking to construct an Interval Tree:

http://en.wikipedia.org/wiki/Interval_tree

Steven Bethard · May 22, 2007

Gabriel said:
the idea would be as follows:

- For each span generate two tuples: (start_offset, 1, end_offset,
element) and (end_offset, 0, -start_offset, element). If start==end use
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when
at a given offset there is an element that ends and another that starts,
the ending element comes first (because it has a 0). For all the
elements that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and
create a root Element as your "current container" (CC), the variable
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
- if the second item is a 1, an element is starting. Insert it into
the CC element, push the CC onto the stack, and set the new element as
the new CC. The element data is text[last_used:start_offset], and move
last_used to start_offset.
- if the second item is a 0, an element is ending. Discard the CC, pop
an element from the stack as the new CC. The element data is
text[last_used:end_offset], move last_used up to end_offset.
- if the second item is a -1, it's an element with no content. Insert
it into the CC element.

Thanks a lot! This put me on the right track (though the devil's
definitely in the details). It's working now::

>>> tree = xmltools.text_and_spans_to_etree('aaa aaa aaaccc cccaaa', [

Click to expand...

Click to expand...

.... (etree.Element('a'), 0, 21),
.... (etree.Element('b'), 11, 11),
.... (etree.Element('c'), 11, 18),
.... ])

' said:
>>> etree.tostring(tree)

Click to expand...

' said:

>>> tree = xmltools.text_and_spans_to_etree('bbb\naaaccc\ncccaaa', [

Click to expand...

Click to expand...

.... (etree.Element('a'), 0, 17),
.... (etree.Element('b'), 0, 4),
.... (etree.Element('c'), 7, 14),
.... (etree.Element('b'), 14, 14),
.... ])

' said:
>>> etree.tostring(tree)

Click to expand...

' said:

>>> tree = xmltools.text_and_spans_to_etree('abc', [

Click to expand...

Click to expand...

.... (etree.Element('a'), 0, 3),
.... (etree.Element('b'), 0, 3),
.... (etree.Element('c'), 0, 3),
.... ])'<a><c>abc</c></a>'

And for the sake of any poor soul who runs into a similar problem,
here's the code I wrote using Gabriel's hints above::

def text_and_spans_to_etree(text, spans):
# Create a list of element starts and ends, sorting them in the
# order they would appear in an XML version of the text. So for
# example, with XML text like:
# <a> a b a </a>
# we will see:
# starting <a>
# starting 
# ending 
# ending </a>
empty = -1
starting = 0
ending = 1
tuples = []
root_elem = None
for i, (elem, start, end) in enumerate(spans):
# validate spans
if start < 0 or end > len(text):
msg = 'spans must be in the range 0-%i'
raise ValueError(msg % len(text))

# save the first element that spans the entire text as the root
elif root_elem is None and start == 0 and end == len(text):
root_elem = elem

# insert a single tuple for empty elements
elif start == end:
tuples.append((start, empty, -end, i, elem))

# insert starts and ends for all other elements
else:
tuples.append((start, starting, -end, i, elem))
tuples.append((start, ending, -end, -i, elem))
tuples.sort()

# make sure we found a root element
if root_elem is None:
raise ValueError('one element must span the entire text')

# iterate through the element starts and ends,
# updating element texts, tails and children
last_offset = 0
last_elem = root_elem
last_type = starting
stack = [root_elem]
for start, offset_type, neg_end, _, elem in tuples:

# start of an element:
# add it as a child and add it to the stack
# next text chunk goes up to the start offset
if offset_type is starting:
stack[-1].append(elem)
stack.append(elem)
offset = start

# end of an element:
# pop if off the stack
# next text chunk goes up to the end offset
elif offset_type is ending:
if elem is not stack[-1]:
print elem, stack[-1]
assert False
assert elem is stack.pop()
offset = -neg_end

# empty element:
# add it as a child
# next text chunk goes up to the start offset
elif offset_type is empty:
stack[-1].append(elem)
offset = start

# should never get here
else:
assert False

# calculate the next text chunk, and then determine what to do
# with it based on what we did the last time around:
# * started an element -> set its .text
# * ended an element (or inserted an empty) -> set its .tail
last_text = text[last_offset

ffset]
if last_type is starting:
last_elem.text = last_text
elif last_type is ending:
last_elem.tail = last_text
elif last_type is empty:
last_elem.tail = last_text
else:
assert False

# save what we did this time for inspection next time
last_offset = offset
last_type = offset_type
last_elem = elem

# add any final text before the close of the root element
last_elem.tail = text[last_offset:]

return root_elem

Thanks again,

STeVe

Neil Cerutti · May 23, 2007

' said:
Thanks a lot! This put me on the right track (though the
devil's definitely in the details). It's working now::

tree = xmltools.text_and_spans_to_etree('aaa aaa aaaccc cccaaa', [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 11),
... (etree.Element('c'), 11, 18),
... ])

etree.tostring(tree)

Click to expand...

' said:

tree = xmltools.text_and_spans_to_etree('bbb\naaaccc\ncccaaa', [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 17),
... (etree.Element('b'), 0, 4),
... (etree.Element('c'), 7, 14),
... (etree.Element('b'), 14, 14),
... ])

etree.tostring(tree)

Click to expand...

' said:

tree = xmltools.text_and_spans_to_etree('abc', [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 3),
... (etree.Element('b'), 0, 3),
... (etree.Element('c'), 0, 3),
... ])'<a><c>abc</c></a>'

And for the sake of any poor soul who runs into a similar
problem, here's the code I wrote using Gabriel's hints above::

When I saw you manually keeping a stack, I called Captain
Recursion on my Red-Alert Recursion Phone.

(I'm sorry he left out the Element stuff, which he doesn't know
or use yet. The Captain's get_tree just returns the string)

def get_tree(text, spans):
"""

>>> text = 'aaa aaa aaabbb bbbaaa'
>>> spans = [

Click to expand...

Click to expand...

... ('a', 0, 21),
... ('b', 11, 18),
... ('c', 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'
"""
if not spans:
return ''
else:
head, tail = spans[0], spans[1:]
elem, start, end = head
if tail:
_, follow_start, follow_end = tail[0]
else:
follow_start, follow_end = (end, end)
if end > start:
return ("<%s>%s%s%s</%s>" %
(elem,
text[start:follow_start],
get_tree(text, tail),
text[follow_end:end],
elem))
else:
return "<%s />%s" % (elem, get_tree(text, tail))

Steven Bethard · May 23, 2007

Neil said:
Thanks a lot! This put me on the right track (though the
devil's definitely in the details). It's working now::

tree = xmltools.text_and_spans_to_etree('aaa aaa aaaccc cccaaa', [

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 11),
... (etree.Element('c'), 11, 18),
... ])

etree.tostring(tree)

Click to expand...

' said:

tree = xmltools.text_and_spans_to_etree('bbb\naaaccc\ncccaaa', [

Click to expand...

... (etree.Element('a'), 0, 17),
... (etree.Element('b'), 0, 4),
... (etree.Element('c'), 7, 14),
... (etree.Element('b'), 14, 14),
... ])

etree.tostring(tree)

Click to expand...

' said:

tree = xmltools.text_and_spans_to_etree('abc', [

Click to expand...

... (etree.Element('a'), 0, 3),
... (etree.Element('b'), 0, 3),
... (etree.Element('c'), 0, 3),
... ])

etree.tostring(tree)

Click to expand...

'<a><c>abc</c></a>'

And for the sake of any poor soul who runs into a similar
problem, here's the code I wrote using Gabriel's hints above::

Click to expand...

When I saw you manually keeping a stack, I called Captain
Recursion on my Red-Alert Recursion Phone.

(I'm sorry he left out the Element stuff, which he doesn't know
or use yet. The Captain's get_tree just returns the string)

Heh heh.

I actually thought about writing it recursively, but note that you need
both recursive and non-recursive parts of this algorithm to do the
ElementTree part right:

* the recursive (or stack) part assigns children to parents
* the non-recursive part assigns text or tail to the previous element
(note that's previous in a sequential sense, not a recursive sense)

I'm sure I could implement this recursively, passing around annother
appropriate argument, but it wasn't obvious to me that the code would be
any cleaner.

STeVe

Neil Cerutti · May 24, 2007

Neil said:
Neil said:

Thanks a lot! This put me on the right track (though the
devil's definitely in the details). It's working now::

tree = xmltools.text_and_spans_to_etree('aaa aaa aaaccc cccaaa', [
... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 11),
... (etree.Element('c'), 11, 18),
... ])
etree.tostring(tree)
'<a>aaa aaa aaa<c>ccc ccc</c>aaa</a>'
tree = xmltools.text_and_spans_to_etree('bbb\naaaccc\ncccaaa', [
... (etree.Element('a'), 0, 17),
... (etree.Element('b'), 0, 4),
... (etree.Element('c'), 7, 14),
... (etree.Element('b'), 14, 14),
... ])
etree.tostring(tree)
'<a>bbb\naaa<c>ccc\nccc</c>aaa</a>'
tree = xmltools.text_and_spans_to_etree('abc', [
... (etree.Element('a'), 0, 3),
... (etree.Element('b'), 0, 3),
... (etree.Element('c'), 0, 3),
... ])
etree.tostring(tree)
'<a><c>abc</c></a>'

And for the sake of any poor soul who runs into a similar
problem, here's the code I wrote using Gabriel's hints above::

Click to expand...

When I saw you manually keeping a stack, I called Captain
Recursion on my Red-Alert Recursion Phone.

(I'm sorry he left out the Element stuff, which he doesn't know
or use yet. The Captain's get_tree just returns the string)

Click to expand...

Heh heh.

I actually thought about writing it recursively, but note that
you need both recursive and non-recursive parts of this
algorithm to do the ElementTree part right:

You mean... I left out the hard part? Shucks. I had really hoped
it didn't matter.

* the recursive (or stack) part assigns children to parents
* the non-recursive part assigns text or tail to the previous element
(note that's previous in a sequential sense, not a recursive sense)

I'm sure I could implement this recursively, passing around
annother appropriate argument, but it wasn't obvious to me that
the code would be any cleaner.

Moreover, it looks like you have experience in writing that sort
of code. I'd have never even attempted it without recursion, but
that's merely exposing one of my limitations.

Neil Cerutti · May 24, 2007

You mean... I left out the hard part? Shucks. I had really
hoped it didn't matter.

Moreover, it looks like you have experience in writing that
sort of code. I'd have never even attempted it without
recursion, but that's merely exposing one of my limitations.

You'll be happy to know I found a way to salvage my simple
recursive solution, and make it generate an ElementTree!

def get_tree(text, spans):
"""

>>> text = 'aaa aaa aaabbb bbbaaa'
>>> spans = [

Click to expand...

Click to expand...

... (etree.Element('a'), 0, 21),
... (etree.Element('b'), 11, 18),
... (etree.Element('c'), 18, 18),
... ]

I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
'<a>aaa aaa aaabbb bbb<c />aaa</a>'
"""
def helper(text, spans):
if not spans:
return ''
else:
head, tail = spans[0], spans[1:]
elem, start, end = head
if tail:
_, follow_start, follow_end = tail[0]
else:
follow_start, follow_end = (end, end)
if end > start:
return ("<%s>%s%s%s</%s>" %
(elem.tag,
text[start:follow_start],
helper(text, tail),
text[follow_end:end],
elem.tag))
else:
return "<%s />%s" % (elem.tag, helper(text, tail))
return etree.XML(helper(text, spans))

But at least I learned just a *little* about XML and Python
during this arduous process.

simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Create xml with elementtree ET and xml escaping	4	Dec 11, 2012
lxml/ElementTree and .tail	30	Nov 15, 2006
elementtree and gbk encoding	12	Mar 14, 2006
Default namespace and attributes with ElementTree 1.3	0	Jun 18, 2009
How to specify an undetermined Object (Generic) as parameter in method() ?	0	Jan 18, 2011
Problem inserting an element where I want it using lxml	2	Jan 5, 2011
using TreeBuilder in an ElementTree like way	0	Jun 28, 2006

converting text and spans to an ElementTree

Steven Bethard

Gabriel Genellina

attn.steven.kuo

Steven Bethard

attn.steven.kuo

Steven Bethard

Neil Cerutti

Steven Bethard

Neil Cerutti

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads