aligning a set of word substrings to sentence

S

Steven Bethard

I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
.... She's gonna write
.... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Here's my current definition of the offsets function::

py> def offsets(tokens, text):
.... start = 0
.... for token in tokens:
.... while text[start].isspace():
.... start += 1
.... text_token = text[start:start+len(token)]
.... assert text_token == token, (text_token, token)
.... yield start, start + len(token)
.... start += len(token)
....

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

STeVe
 
F

Fredrik Lundh

Steven said:
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

hope this helps!

</F>
 
S

Steven Bethard

Fredrik said:
Steven said:
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

That's what I was looking for. Thanks!

STeVe
 
P

Paul McGuire

Steven Bethard said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Hey, I get the same answer with this:

===================
from pyparsing import oneOf

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

tokenlist = oneOf( " ".join(tokens) )
offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]

print offsets
===================
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]


Of course, pyparsing may be a bit heavyweight to drag into a simple function
like this, and certainly not near as fast as regexp. But it was such a nice
way to show how scanString works.

Pyparsing's "oneOf" helper function takes care of the same longest match
issues that Fredrik Lundh handles using sort, reverse, etc. Just so long as
none of the tokens has an embedded space character.

-- Paul
 
S

Steven Bethard

Paul said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

===================
from pyparsing import oneOf

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

tokenlist = oneOf( " ".join(tokens) )
offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]

print offsets
===================
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Now that's a pretty solution. Three cheers for pyparsing! :)

STeVe
 
M

Michael Spencer

Steven said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Here's my current definition of the offsets function::

py> def offsets(tokens, text):
... start = 0
... for token in tokens:
... while text[start].isspace():
... start += 1
... text_token = text[start:start+len(token)]
... assert text_token == token, (text_token, token)
... yield start, start + len(token)
... start += len(token)
...

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

STeVe

Hi Steve:

Any reason you can't simply use str.find in your offsets function?
... ptr = 0
... for token in tokens:
... fpos = text.find(token, ptr)
... if fpos != -1:
... end = fpos + len(token)
... yield (fpos, end)
... ptr = end
...
>>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>

and then, for an entry in the wacky category, a difflib solution:
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...
>>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>

cheers
Michael
 
F

Fredrik Lundh

Steven said:
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

That's what I was looking for. Thanks!

except that I misread your problem statement; the RE solution above allows the
tokens to be specified in arbitrary order. if they've always ordered, you can re-
place the code with something like:

# match tokens plus optional whitespace between each token
pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
m = re.match(pattern, text)
result = (m.span(i+1) for i in range(len(tokens)))

which is 6-7 times faster than the previous solution, on my machine.

</F>
 
S

Steven Bethard

Fredrik said:
Steven Bethard wrote:

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

That's what I was looking for. Thanks!


except that I misread your problem statement; the RE solution above allows the
tokens to be specified in arbitrary order. if they've always ordered, you can re-
place the code with something like:

# match tokens plus optional whitespace between each token
pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
m = re.match(pattern, text)
result = (m.span(i+1) for i in range(len(tokens)))

which is 6-7 times faster than the previous solution, on my machine.

Ahh yes, that's faster for me too. Thanks again!

STeVe
 
S

Steven Bethard

Michael said:
Steven said:
I've got a list of word substrings (the "tokens") which I need to
align to a string of text (the "sentence"). The sentence is basically
the concatenation of the token list, with spaces sometimes inserted
beetween tokens. I need to determine the start and end offsets of
each token in the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
[snip]

and then, for an entry in the wacky category, a difflib solution:
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

That's cool, I've never seen that before. If you pass in str.isspace,
you can even drop the "if length:" line::

py> def offsets(tokens, text):
.... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
.... for start, _, length in s.get_matching_blocks():
.... yield start, start + length
....
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25), (25, 25)]

I think I'm going to have to take a closer look at
difflib.SequenceMatcher; I have to do things similar to this pretty often...

STeVe
 
S

Steven Bethard

Steven said:
Michael said:
Steven said:
I've got a list of word substrings (the "tokens") which I need to
align to a string of text (the "sentence"). The sentence is
basically the concatenation of the token list, with spaces sometimes
inserted beetween tokens. I need to determine the start and end
offsets of each token in the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25)]
[snip]


and then, for an entry in the wacky category, a difflib solution:
def offsets(tokens, text):
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...
list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25)]


That's cool, I've never seen that before. If you pass in str.isspace,
you can even drop the "if length:" line::

py> def offsets(tokens, text):
... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
... for start, _, length in s.get_matching_blocks():
... yield start, start + length
...
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25), (25, 25)]

Sorry, that should have been::
list(offsets(tokens, text))[:-1]
since the last item is always the zero-length one. Which means you
don't really need str.isspace either.

STeVe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,805
Latest member
ClydeHeld1

Latest Threads

Top