Simple regex with whitespaces

mathieu.malaterre · Sep 11, 2006

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu

Mark Peters · Sep 11, 2006

Which of course does not work. I cannot express the fact: sentence

have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !

How about this:
['', 'hello world', 'how are you']

James Stroud · Sep 11, 2006

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu

py> import re
py> s1 = ' how are you'
py> s2 = ' hello world how are you'
py> s3 = 'group here now here but not here but now here'
py> patt_2plus = re.compile(r'(?

?:\S+(?:\s|$))+(?:\s+|$)){2,}')
py> patt_3plus = re.compile(r'(?

?:\S+(?:\s|$))+(?:\s+|$)){3,}')

positive tests:

py> patt_2plus.search(s2).group(0)
'hello world how are you'
py> patt_2plus.search(s3).group(0)
'group here now here but not here but now here'
py> patt_3plus.search(s3).group(0)
'group here now here but not here but now here'

negative tests:

py> patt_3plus.search(s2).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
py> patt_3plus.search(s1).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
py> patt_2plus.search(s1).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

John Machin · Sep 11, 2006

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu

1. A "word" is one or more non-whitespace charaters -- subpattern is
\S+
2. A "sentence" is one or more words separated by a single white space
IOW a word followed by zero or more occurrences of whitespace+word --
so a sentence will be matched by \S+(\s\S+)*
3. Leading and trailing runs of whitespace should be ignored -- use \s*
4. You will need to detect the case of 0 sentences (all whitespace)
separately -- I trust you don't need to be told how to do that

Paul McGuire · Sep 11, 2006

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu

A pyparsing approach is not as terse as regexp's, but it's not terribly long
either. Following John Machin's submission as a pattern:

s1 = ' how are you'
s2 = ' hello world how are you'

from pyparsing import *

wd = Word(printables)
# this is necessary to suppress pyparsing's built-in whitespace skipping
wd.leaveWhitespace()
sentence = delimitedList(wd, delim=White(' ',exact=1))

for test in (s1,s2):
print sentence.searchString(test)

Pyparsing returns data as ParseResults objects, which can be accessed as
lists or dicts. From this first cut, we get:
[['how', 'are', 'you']]
[['hello', 'world'], ['how', 'are', 'you']]

These aren't really sentences any more, but we can have pyparsing put them
back into sentences, by adding a parse action to sentence.

sentence.setParseAction(lambda toks: " ".join(toks))

Now our results are:
[['how are you']]
[['hello world'], ['how are you']]

If you really want to get fancy, and clean up some of that capitalization
and lack of punctuation, you can add a more elaborate parse action instead:
ispunc = lambda s: s in ".!?;:,"
sixLoyalServingMen = ('What','Why','When','How','Where','Who')
def cleanup(t):
t[0] = t[0].title()
if not ispunc( t[-1][-1] ):
if t[0] in sixLoyalServingMen:
punc = "?"
else:
punc = "."
else:
punc = ""
return " ".join(t) + punc
sentence.setParseAction(cleanup)

This time we get:
[['How are you?']]
[['Hello world.'], ['How are you?']]

The pyparsing home page is at pyparsing.wikispaces.com.

-- Paul

James Stroud · Sep 11, 2006

Paul said:
Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings: [clip]
Any suggestion ? Thanks a bunch !
Mathieu

Click to expand...

A pyparsing approach is not as terse as regexp's, but it's not terribly long
either.

To second Paul's suggestion, usually, if the regex is not immediately
obvious, its probably better to look into parsing modules, pyparsing
being one of the most accessible modules (to me, anyways)--and well
worth learning.

However, in complicated applications, regex is usually still fun and
valuable as an intellectual exercise.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

John Machin · Sep 11, 2006

James said:
Paul said:

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings: [clip]
Any suggestion ? Thanks a bunch !
Mathieu

Click to expand...

A pyparsing approach is not as terse as regexp's, but it's not terribly long
either.

Click to expand...

To second Paul's suggestion, usually, if the regex is not immediately
obvious, its probably better to look into parsing modules, pyparsing
being one of the most accessible modules (to me, anyways)--and well
worth learning.

I would say that it's better, before leaping to the implementation, to
understand the problem. IOW, rough out the grammar first --- in the
current case, not very complicated at all; the way I looked at it there
are only two possible outcomes from the tokeniser (space and not-space)
and only three non-terminal symbols (word, sentence, paragraph) -- then
choose the implementation.

However, in complicated applications, regex is usually still fun and
valuable as an intellectual exercise.

Indeed. Better use of the mind and the CPU than distractions du jour
like pseudorubiku or whatever it's called

Cheers,
John

regex question	4	Feb 13, 2008
Questions about regex	3	May 29, 2009
compound regex	0	Feb 9, 2009
Text parsing via regex	10	Dec 8, 2008
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Help with regex	11	Aug 6, 2009
Puzzled about this regex	0	Apr 18, 2009
Clickable link conversion regex?	0	Nov 30, 2012

Simple regex with whitespaces

mathieu.malaterre

Mark Peters

James Stroud

John Machin

Paul McGuire

James Stroud

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads