Help with splitting

R

RickMuller

I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

Thanks in advance.

R.
 
J

Jeremy Bowers

I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', '']
whitespaceSplitter.split(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.
 
B

Brian Beck

RickMuller said:
There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

The re solution Jeremy Bowers is what you want. Here's another (probably
much slower) way for fun (with no surrounding empty strings):

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']


I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...
 
R

Raymond Hettinger

[Brian Beck]>
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']

Brilliant solution!

That leads to a better understanding of groupby as a tool for identifying
transitions without consuming them.

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

Right.
attrgetter gets but does not call.

If unicode isn't an issue, then the lambda can be removed:
[''.join(g) for k, g in groupby(' test ing ', str.isspace)]
[' ', 'test', ' ', 'ing', ' ']



Raymond Hettinger
 
J

Jeremy Bowers

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

Unfortunately, as you pointed out, it is slower:

python timeit.py -s
"import re; x = 'a ab c' * 1000; whitespaceSplitter = re.compile('(\w+)')"

"whitespaceSplitter.split(x)"

100 loops, best of 3: 9.47 msec per loop

python timeit.py -s
"from itertools import groupby; x = 'a ab c' * 1000;"

"[''.join(g) for k, g in groupby(x, lambda y: y.isspace())]"

10 loops, best of 3: 65.8 msec per loop

(tried to break it up to be easier to read)

But I like yours much better theoretically. It's also a pretty good demo
of "groupby".
 
R

RickMuller

Thanks to everyone who responded!! I guess I have to study my regular
expressions a little more closely.
 
G

George Sakkis

Jeremy said:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', '']
whitespaceSplitter.split(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.

If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5']
whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']


George
 
R

Reinhold Birkenfeld

George said:
If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5']
whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Perhaps you may want to use "\s+|\S+" if you have non-alphanumeric
characters in the string.

Reinhold
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,232
Messages
2,571,168
Members
47,803
Latest member
ShaunaSode

Latest Threads

Top