aligning text with space-normalized text

Steven Bethard · Jun 30, 2005

I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:

py> text = """\
.... aaa bb ccc
.... dd eee. fff gggg
.... hh i.
.... jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
.... print repr(text[s:e])
....
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices. Here's what I have:

py> def get_indices(text, chunks):
.... chunks = iter(chunks)
.... chunk = None
.... for text_index, c in enumerate(text):
.... if c.isspace():
.... continue
.... if chunk is None:
.... chunk = chunks.next().replace(' ', '')
.... chunk_start = text_index
.... chunk_index = 0
.... if c != chunk[chunk_index]:
.... raise Exception('unmatched: %r %r' %
.... (c, chunk[chunk_index]))
.... else:
.... chunk_index += 1
.... if chunk_index == len(chunk):
.... yield chunk_start, text_index + 1
.... chunk = None
....

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions.

John Machin · Jun 30, 2005

Steven Bethard wrote:
[snip]

And it appears to work: [snip]
But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions.

Perhaps you should define "work" before you worry about """subjectively
"better" solutions""".

If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']

Steven Bethard · Jun 30, 2005

John said:
If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec which said that chunks
are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.

STeVe

Peter Otten · Jun 30, 2005

Steven said:
I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some

If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()

def main():
text = """\
aaa bb ccc
dd eee. fff gggg
hh i.
jjj kk.
"""
chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
main()

Not tested beyond what you see.

Peter

John Machin · Jun 30, 2005

Steven said:
John said:

If "work" is meant to detect *all* possibilities of 'chunks' not
having been derived from 'text' in the described manner, then it
doesn't work -- all information about the positions of the whitespace
is thrown away by your code.

For example, text = 'foo bar', chunks = ['foobar']

Click to expand...

This doesn't match the (admittedly vague) spec

That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.

which said that chunks

are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.

If it helps, you can re-read my message.

Steven Bethard · Jul 1, 2005

John said:
Steven said:

John said:

For example, text = 'foo bar', chunks = ['foobar']

Click to expand...

This doesn't match the (admittedly vague) spec

Click to expand...

That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.

Well, the input should never look like the above. But if for some
reason it did, I wouldn't want the error; I'd want the indices. So:
text = 'foo bar'
chunks = ['foobar']
should produce:
[(0, 7)]
not an exception.

STeVe

Steven Bethard · Jul 1, 2005

Peter said:
import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()

Thanks, that's a really nice, clean solution!

STeVe

Better crypto hash functions, long, with code	2	Aug 26, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 4 of 4: glossary	0	Jul 8, 2003

aligning text with space-normalized text

Steven Bethard

John Machin

Steven Bethard

Peter Otten

John Machin

Steven Bethard

Steven Bethard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads