Steven Bethard
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::
py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
.... She's gonna write
.... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
Here's my current definition of the offsets function::
py> def offsets(tokens, text):
.... start = 0
.... for token in tokens:
.... while text[start].isspace():
.... start += 1
.... text_token = text[start:start+len(token)]
.... assert text_token == token, (text_token, token)
.... yield start, start + len(token)
.... start += len(token)
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::
py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
.... She's gonna write
.... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
Here's my current definition of the offsets function::
py> def offsets(tokens, text):
.... start = 0
.... for token in tokens:
.... while text[start].isspace():
.... start += 1
.... text_token = text[start:start+len(token)]
.... assert text_token == token, (text_token, token)
.... yield start, start + len(token)
.... start += len(token)
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?