Confused by slash/escape in regexp

A

andrew cooke

Is the third case here surprising to anyone else? It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Curious/confused,
Andrew
 
A

andrew cooke

In the first case, *python* will unescape the string literal '\x62' into
letters 'b'. In the second case, python will unescape the double
backslash '\\' into a single slash '\' and *regex* will unescape the
single-slash-62 into 'b'. In the third case, *python* will unescape
double backslash '\\' into single-slash '\' and byte-string-62 '\x62' to
letter-b 'b', and regex received it as 'a\bc', which interpreted as a
special character to regex:
"""
\b       Matches the empty string, but only at the start or end of a word.
"""

ah, brilliant! yes. thank-you very much!

andrew
 
P

Paul McGuire

Is the third case here surprising to anyone else?  It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> from re import compile
Curious/confused,
Andrew

Here is your same session, but using raw string literals:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
So I would say the surprise isn't that case 3 didn't match, but that
case 2 matched.

Unless I just don't get what you were testing, not being an RE wiz.

-- Paul
 
A

andrew cooke

So I would say the surprise isn't that case 3 didn't match, but that
case 2 matched.

Unless I just don't get what you were testing, not being an RE wiz.

Case 2 is the regexp engine interpreting escapes that appear as
literal strings. It's weird, because what's the point of Python would
do it for you anyway, but it seems to be the correct behaviour.

Andrew
 
M

MRAB

andrew said:
Is the third case here surprising to anyone else? It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

'a\x62c' is a string literal which is the same as 'abc', so re.compile
receives the characters:

abc

as the regex, which matches the string:

abc

'a\\x62c' is a string literal which represents the characters:

a\x62c

so re.compile receives these characters as the regex.

The re module understands has its own set of escape sequences, most of
which are the same as Python's string escape sequences. The re module
treats \x62 like the string escape, ie it represents the character 'b',
so this regex is the same as:

abc

'a\\\x62c' is a string literal which is the same as 'a\\bc', so
re.compile receives the characters:

a\bc

as the regex.

The re module treats the \b in a regex as representing a word boundary,
unless it's in a character set, eg. [\b].

The regex will try to match a word boundary sandwiched between 2
letters, which can never happen.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top