S
Steven Bethard
I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:
TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
And the corresponding SGML looks like:
<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .
Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.
I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::
def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.''' <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::
Traceback (most recent call last):
...
AssertionError: sorry, but this version only supports 100 named
groups
I also played around with difflib.SequenceMatcher for a while, but
couldn't get a solution based on that working. Any suggestions?
[1]http://mail.python.org/pipermail/python-list/2005-December/313388.html
Thanks,
STeVe
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:
TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
And the corresponding SGML looks like:
<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .
Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.
I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::
def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.''' <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::
Traceback (most recent call last):
...
AssertionError: sorry, but this version only supports 100 named
groups
I also played around with difflib.SequenceMatcher for a while, but
couldn't get a solution based on that working. Any suggestions?
[1]http://mail.python.org/pipermail/python-list/2005-December/313388.html
Thanks,
STeVe