Regular expression bug?

Ron Garret · Feb 19, 2009

I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

Click to expand...

['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.

Thanks,
rg

Albert Hopkins · Feb 19, 2009

I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

That's how re.split works, same as str.split...

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

Click to expand...

['', '']

Wow!

To tell you the truth, I can't even read that... but one wonders why
don't you just do

def ccsplit(s):
cclist = []
current_word = ''
for char in s:
if char in string.uppercase:
if current_word:
cclist.append(current_word)
current_word = char
else:
current_word += char
if current_word:
ccl.append(current_word)
return cclist
--> ['foo', 'Bar', 'Baz']

This is arguably *much* more easy to read than the re example doesn't
require one to look ahead in the string.

-a

Kurt Smith · Feb 19, 2009

I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

Click to expand...

['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

From what I can tell, re.split can't split on zero-length boundaries.

It needs something to split on, like str.split. Is this a bug?
Possibly. The docs for re.split say:

Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.

Note that it does not say that zero-length matches won't work.

I can work around the problem thusly:

re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_')

Which is ugly. I reckon you can use re.findall with a pattern that
matches the components and not the boundaries, but you have to take
care of the beginning and end as special cases.

Kurt

Peter Otten · Feb 19, 2009

Ron said:
I'm trying to split a CamelCase string into its constituent components.

How about

re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")

Click to expand...

Click to expand...

['foo', 'Bar', 'Baz']

This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

Click to expand...

['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

IRC the split pattern must consume at least one character, but I can't find
the reference.

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

It's coded in C. The source is Modules/sremodule.c.

Peter

MRAB · Feb 19, 2009

Ron said:
I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

Click to expand...

['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.

I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.
You could do this instead:

>>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')

Click to expand...

Click to expand...

['foo', 'Bar', 'Baz']

Ron Garret · Feb 19, 2009

MRAB said:
Ron said:

I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.

Click to expand...

I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.

That seems unlikely. It would only break where people had code invoking
re.split on empty matches, which at the moment is essentially a no-op.
It's hard to imagine there's a lot of code like that around. What would
be the point?

You could do this instead:

re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')

Click to expand...

Click to expand...

['foo', 'Bar', 'Baz']

Blech! ;-) But thanks for the suggestion.

rg

Ron Garret · Feb 19, 2009

Peter Otten said:
Ron said:

I'm trying to split a CamelCase string into its constituent components.

Click to expand...

How about

re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")

Click to expand...

Click to expand...

['foo', 'Bar', 'Baz']

That's very clever. Thanks!

It's coded in C. The source is Modules/sremodule.c.

Ah. Thanks!

rg

Ron Garret · Feb 19, 2009

"andrew cooke said:
i wonder what fraction of people posting with "bug?" in their titles here
actually find bugs?

IMHO it ought to be an invariant that len(r.split(s)) should always be
one more than len(r.findall(s)).

anyway, how about:

re.findall('[A-Z]?[a-z]*', 'fooBarBaz')

or

re.findall('([A-Z][a-z]*|[a-z]+)', 'fooBarBaz')

That will do it. Thanks!

rg

Ron Garret · Feb 19, 2009

Albert Hopkins said:
I'm trying to split a CamelCase string into its constituent components.
This kind of works:

re.split('[a-z][A-Z]', 'fooBarBaz')

Click to expand...

['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

Click to expand...

That's how re.split works, same as str.split...

I think one could make the argument that 'foo'.split('') ought to return
['f','o','o']

re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

Click to expand...

['fooBarBaz']

However, it does seem to work with findall:

re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

Click to expand...

['', '']

Click to expand...

Wow!

To tell you the truth, I can't even read that...

It's a regexp. Of course you can't read it. ;-)

rg

Steven D'Aprano · Feb 20, 2009

andrew said:
i wonder what fraction of people posting with "bug?" in their titles here
actually find bugs?

About 99.99%.

Unfortunately, 99.98% have found bugs in their code, not in Python.

Lie Ryan · Feb 20, 2009

Peter Otten said:
Peter Otten said:

Ron said:

I'm trying to split a CamelCase string into its constituent
components.

Click to expand...

How about

re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")

Click to expand...

['foo', 'Bar', 'Baz']

Click to expand...

That's very clever. Thanks!

It's coded in C. The source is Modules/sremodule.c.

Click to expand...

Ah. Thanks!

rg

This re.split() doesn't consume character:

re.split('([A-Z][a-z]*)', 'fooBarBaz')

Click to expand...

Click to expand...

['foo', 'Bar', '', 'Baz', '']

it does what the OP wants, albeit with extra blank strings.

umarpy · Feb 20, 2009

More elegant way

[x for x in re.split('([A-Z]+[a-z]+)', a) if x ]

Click to expand...

Click to expand...

['foo', 'Bar', 'Baz']

R.

Ron Garret wrote:
I'm trying to split a CamelCase string into its constituent
components.
How about
re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']

Click to expand...

Click to expand...

That's very clever. Thanks!

Click to expand...

Ah. Thanks!

Click to expand...

rg

Click to expand...

This re.split() doesn't consume character:

re.split('([A-Z][a-z]*)', 'fooBarBaz')

Click to expand...

Click to expand...

['foo', 'Bar', '', 'Baz', '']

it does what the OP wants, albeit with extra blank strings.

Click to expand...

Regular expression problem	13	Mar 10, 2013
Regular Expression : Bad Character Range	0	Dec 20, 2013
OverflowError: regular expression code size limit exceeded	0	Apr 16, 2008
Freeze problem with Regular Expression	8	Jun 25, 2008
regular expression for nested parentheses	5	Dec 9, 2007
Repeating assertions in regular expression	3	Jan 3, 2012
Pathological regular expression	18	Apr 9, 2009
python -regular expression - list element	3	Jun 25, 2008

Regular expression bug?

Ron Garret

Albert Hopkins

Kurt Smith

Peter Otten

MRAB

Ron Garret

Ron Garret

Ron Garret

Ron Garret

Steven D'Aprano

Lie Ryan

umarpy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads