Confused by slash/escape in regexp

andrew cooke · Apr 11, 2010

Is the third case here surprising to anyone else? It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Curious/confused,
Andrew

andrew cooke · Apr 12, 2010

In the first case, *python* will unescape the string literal '\x62' into
letters 'b'. In the second case, python will unescape the double
backslash '\\' into a single slash '\' and *regex* will unescape the
single-slash-62 into 'b'. In the third case, *python* will unescape
double backslash '\\' into single-slash '\' and byte-string-62 '\x62' to
letter-b 'b', and regex received it as 'a\bc', which interpreted as a
special character to regex:
"""
\b Matches the empty string, but only at the start or end of a word.
"""

ah, brilliant! yes. thank-you very much!

andrew

Paul McGuire · Apr 12, 2010

Is the third case here surprising to anyone else? It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> from re import compile
Curious/confused,
Andrew

Here is your same session, but using raw string literals:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
So I would say the surprise isn't that case 3 didn't match, but that
case 2 matched.

Unless I just don't get what you were testing, not being an RE wiz.

-- Paul

andrew cooke · Apr 12, 2010

On Apr 11 said:
So I would say the surprise isn't that case 3 didn't match, but that
case 2 matched.

Unless I just don't get what you were testing, not being an RE wiz.

Case 2 is the regexp engine interpreting escapes that appear as
literal strings. It's weird, because what's the point of Python would
do it for you anyway, but it seems to be the correct behaviour.

Andrew

MRAB · Apr 12, 2010

andrew said:
Is the third case here surprising to anyone else? It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

'a\x62c' is a string literal which is the same as 'abc', so re.compile
receives the characters:

abc

as the regex, which matches the string:

abc

'a\\x62c' is a string literal which represents the characters:

a\x62c

so re.compile receives these characters as the regex.

The re module understands has its own set of escape sequences, most of
which are the same as Python's string escape sequences. The re module
treats \x62 like the string escape, ie it represents the character 'b',
so this regex is the same as:

abc

'a\\\x62c' is a string literal which is the same as 'a\\bc', so
re.compile receives the characters:

a\bc

as the regex.

The re module treats the \b in a regex as representing a word boundary,
unless it's in a character set, eg. [\b].

The regex will try to match a word boundary sandwiched between 2
letters, which can never happen.

terminate called after throwing an instance of 'CABRTException'	0	Oct 2, 2012
What happened to module.__file__?	2	Dec 12, 2011
repr(complex) in Py3.1	3	Oct 24, 2009
string replace for back slash	0	Feb 5, 2009
myths about python 3	68	Jan 27, 2010
reference vs. name space question	3	Oct 9, 2010
Odd listcomp behaviour	0	Dec 17, 2010
Can I get a technical explanation on the following error	10	May 24, 2009

Confused by slash/escape in regexp

andrew cooke

andrew cooke

Paul McGuire

andrew cooke

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads