Simple regex with whitespaces

M

mathieu.malaterre

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu
 
M

Mark Peters

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
How about this:
['', 'hello world', 'how are you']
 
J

James Stroud

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu


py> import re
py> s1 = ' how are you'
py> s2 = ' hello world how are you'
py> s3 = 'group here now here but not here but now here'
py> patt_2plus = re.compile(r'(?:(?:\S+(?:\s|$))+(?:\s+|$)){2,}')
py> patt_3plus = re.compile(r'(?:(?:\S+(?:\s|$))+(?:\s+|$)){3,}')

positive tests:

py> patt_2plus.search(s2).group(0)
'hello world how are you'
py> patt_2plus.search(s3).group(0)
'group here now here but not here but now here'
py> patt_3plus.search(s3).group(0)
'group here now here but not here but now here'



negative tests:

py> patt_3plus.search(s2).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
py> patt_3plus.search(s1).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
py> patt_2plus.search(s1).group(0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'



James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
J

John Machin

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu

1. A "word" is one or more non-whitespace charaters -- subpattern is
\S+
2. A "sentence" is one or more words separated by a single white space
IOW a word followed by zero or more occurrences of whitespace+word --
so a sentence will be matched by \S+(\s\S+)*
3. Leading and trailing runs of whitespace should be ignored -- use \s*
4. You will need to detect the case of 0 sentences (all whitespace)
separately -- I trust you don't need to be told how to do that :)
5. Don't try to match two or more sentences; match one sentence, and
anything that fails must 0 or 2+ sentences.

So :

|>>> s1 = ' how are you'
|>>> s2 = ' hello world how are you'
|>>> pat = r"^\s*\S+(\s\S+)*\s*$"
|>>> import re
|>>> re.match(pat, s1)
|<_sre.SRE_Match object at 0x00AED9E0>
|>>> re.match(pat, s2)
|>>> re.match(pat, ' ')
|>>> re.match(pat, ' a b ')
|>>> re.match(pat, ' a b ')
|<_sre.SRE_Match object at 0x00AED8E0>
|>>> re.match(pat, ' ab ')
|<_sre.SRE_Match object at 0x00AED920>
|>>> re.match(pat, ' a ')
|<_sre.SRE_Match object at 0x00AED9E0>
|>>> re.match(pat, 'a')
|<_sre.SRE_Match object at 0x00AED8E0>
|>>>

HTH,
John
 
P

Paul McGuire

Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:

s1 = ' how are you'
s2 = ' hello world how are you'

All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')

Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.

Any suggestion ? Thanks a bunch !
Mathieu
A pyparsing approach is not as terse as regexp's, but it's not terribly long
either. Following John Machin's submission as a pattern:

s1 = ' how are you'
s2 = ' hello world how are you'

from pyparsing import *

wd = Word(printables)
# this is necessary to suppress pyparsing's built-in whitespace skipping
wd.leaveWhitespace()
sentence = delimitedList(wd, delim=White(' ',exact=1))

for test in (s1,s2):
print sentence.searchString(test)


Pyparsing returns data as ParseResults objects, which can be accessed as
lists or dicts. From this first cut, we get:
[['how', 'are', 'you']]
[['hello', 'world'], ['how', 'are', 'you']]

These aren't really sentences any more, but we can have pyparsing put them
back into sentences, by adding a parse action to sentence.

sentence.setParseAction(lambda toks: " ".join(toks))

Now our results are:
[['how are you']]
[['hello world'], ['how are you']]


If you really want to get fancy, and clean up some of that capitalization
and lack of punctuation, you can add a more elaborate parse action instead:
ispunc = lambda s: s in ".!?;:,"
sixLoyalServingMen = ('What','Why','When','How','Where','Who')
def cleanup(t):
t[0] = t[0].title()
if not ispunc( t[-1][-1] ):
if t[0] in sixLoyalServingMen:
punc = "?"
else:
punc = "."
else:
punc = ""
return " ".join(t) + punc
sentence.setParseAction(cleanup)

This time we get:
[['How are you?']]
[['Hello world.'], ['How are you?']]


The pyparsing home page is at pyparsing.wikispaces.com.

-- Paul
 
J

James Stroud

Paul said:
Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings: [clip]
Any suggestion ? Thanks a bunch !
Mathieu

A pyparsing approach is not as terse as regexp's, but it's not terribly long
either.

To second Paul's suggestion, usually, if the regex is not immediately
obvious, its probably better to look into parsing modules, pyparsing
being one of the most accessible modules (to me, anyways)--and well
worth learning.

However, in complicated applications, regex is usually still fun and
valuable as an intellectual exercise.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
J

John Machin

James said:
Paul said:
Hello,

I cannot figure out a way to find a regular expression that would
match one and only one of these two strings: [clip]
Any suggestion ? Thanks a bunch !
Mathieu

A pyparsing approach is not as terse as regexp's, but it's not terribly long
either.

To second Paul's suggestion, usually, if the regex is not immediately
obvious, its probably better to look into parsing modules, pyparsing
being one of the most accessible modules (to me, anyways)--and well
worth learning.

I would say that it's better, before leaping to the implementation, to
understand the problem. IOW, rough out the grammar first --- in the
current case, not very complicated at all; the way I looked at it there
are only two possible outcomes from the tokeniser (space and not-space)
and only three non-terminal symbols (word, sentence, paragraph) -- then
choose the implementation.
However, in complicated applications, regex is usually still fun and
valuable as an intellectual exercise.

Indeed. Better use of the mind and the CPU than distractions du jour
like pseudorubiku or whatever it's called :)

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top