Python regular expression question!

U

unexpected

I'm trying to do a whole word pattern match for the term 'MULTX-'

Currently, my regular expression syntax is:

re.search(('^')+(keyword+'\\b')

where keyword comes from a list of terms. ('MULTX-' is in this list,
and hence a keyword).

My regular expression works for a variety of different keywords except
for 'MULTX-'. It does work for MULTX, however, so I'm thinking that the
'-' sign is delimited as a word boundary. Is there any way to get
Python to override this word boundary?

I've tried using raw strings, but the syntax is painful. My attempts
were:

re.search(('^')+("r"+keyword+'\b')
re.search(('^')+("r'"+keyword+'\b')

and then tried the even simpler:

re.search(('^')+("r'"+keyword)
re.search(('^')+("r''"+keyword)


and all of those failed for everything. Any suggestions?
 
H

Hallvard B Furuseth

unexpected said:
I'm trying to do a whole word pattern match for the term 'MULTX-'

Currently, my regular expression syntax is:

re.search(('^')+(keyword+'\\b')

\b matches the beginning/end of a word (characters a-zA-Z_0-9).
So that regex will match e.g. MULTX-FOO but not MULTX-.

Incidentally, in case the keyword contains regex special characters
(like '*') you may wish to escape it: re.escape(keyword).
 
U

unexpected

\b matches the beginning/end of a word (characters a-zA-Z_0-9).
So that regex will match e.g. MULTX-FOO but not MULTX-.

So is there a way to get \b to include - ?
 
A

Ant

unexpected said:
So is there a way to get \b to include - ?

No, but you can get the behaviour you want using negative lookaheads.
The following regex is effectively \b where - is treated as a word
character:

pattern = r"(?![a-zA-Z0-9_-])"

This effectively matches the next character that isn't in the group
[a-zA-Z0-9_-] but doesn't consume it. For example:
p = re.compile(r".*?(?![a-zA-Z0-9_-])(.*)")
s = "aabbcc_d-f-.XXX YYY"
m = p.search(s)
print m.group(1)
..XXX YYY

Note that the regex recognises the '.' as the end of the word, but
doesn't use it up in the match, so it is present in the final capturing
group. Contrast it with:
p = re.compile(r".*?[^a-zA-Z0-9_-](.*)")
s = "aabbcc_d-f-.XXX YYY"
m = p.search(s)
print m.group(1)
XXX YYY

Note here that "[^a-zA-Z0-9_-]" still denotes the end of the word, but
this time consumes it, so it doesn't appear in the final captured group.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top