Hello,
I cannot figure out a way to find a regular expression that would
match one and only one of these two strings:
s1 = ' how are you'
s2 = ' hello world how are you'
All I could come up with was:
patt = re.compile('^[ ]*([A-Za-z]+)[ ]+([A-Za-z]+)$')
Which of course does not work. I cannot express the fact: sentence
have 0 or 1 whitespace, separation of group have two or more
whitespaces.
Any suggestion ? Thanks a bunch !
Mathieu
A pyparsing approach is not as terse as regexp's, but it's not terribly long
either. Following John Machin's submission as a pattern:
s1 = ' how are you'
s2 = ' hello world how are you'
from pyparsing import *
wd = Word(printables)
# this is necessary to suppress pyparsing's built-in whitespace skipping
wd.leaveWhitespace()
sentence = delimitedList(wd, delim=White(' ',exact=1))
for test in (s1,s2):
print sentence.searchString(test)
Pyparsing returns data as ParseResults objects, which can be accessed as
lists or dicts. From this first cut, we get:
[['how', 'are', 'you']]
[['hello', 'world'], ['how', 'are', 'you']]
These aren't really sentences any more, but we can have pyparsing put them
back into sentences, by adding a parse action to sentence.
sentence.setParseAction(lambda toks: " ".join(toks))
Now our results are:
[['how are you']]
[['hello world'], ['how are you']]
If you really want to get fancy, and clean up some of that capitalization
and lack of punctuation, you can add a more elaborate parse action instead:
ispunc = lambda s: s in ".!?;:,"
sixLoyalServingMen = ('What','Why','When','How','Where','Who')
def cleanup(t):
t[0] = t[0].title()
if not ispunc( t[-1][-1] ):
if t[0] in sixLoyalServingMen:
punc = "?"
else:
punc = "."
else:
punc = ""
return " ".join(t) + punc
sentence.setParseAction(cleanup)
This time we get:
[['How are you?']]
[['Hello world.'], ['How are you?']]
The pyparsing home page is at pyparsing.wikispaces.com.
-- Paul