re.sub(): replace longest match instead of leftmost match?

John Gordon · Dec 16, 2011

According to the documentation on re.sub(), it replaces the leftmost
matching pattern.

However, I want to replace the *longest* matching pattern, which is
not necessarily the leftmost match. Any suggestions?

I'm working with IPv6 CIDR strings, and I want to replace the longest
match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
the leftmost match, even if there is a longer match later in the string.

I'm also looking for a regexp that will remove leading zeroes in each
four-digit group, but will leave a single zero if the group was all
zeroes.

Thanks!

Devin Jeanpierre · Dec 16, 2011

You could use re.finditer to find the longest match, and then replace
it manually by hand (via string slicing).

(a match is the longest if (m.end() - m.start()) is the largest --
so, max(re.finditer(...), key=lambda m: (m.end() = m.start()))

-- Devin

P.S. does anyone else get bothered by how it's slice.start and
slice.stop, but match.start() and match.end() ?

MRAB · Dec 16, 2011

According to the documentation on re.sub(), it replaces the leftmost
matching pattern.

However, I want to replace the *longest* matching pattern, which is
not necessarily the leftmost match. Any suggestions?

I'm working with IPv6 CIDR strings, and I want to replace the longest
match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
the leftmost match, even if there is a longer match later in the string.

I'm also looking for a regexp that will remove leading zeroes in each
four-digit group, but will leave a single zero if the group was all
zeroes.

How about this:

result = re.sub(r"\b0+(\d)\b", r"\1", string)

Ian Kelly · Dec 16, 2011

How about this:

result = re.sub(r"\b0+(\d)\b", r"\1", string)

Close.

pattern = r'\b0+([1-9a-f]+|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)

Cheers,
Ian

Ian Kelly · Dec 16, 2011

How about this:

result = re.sub(r"\b0+(\d)\b", r"\1", string)

Click to expand...

Close.

pattern = r'\b0+([1-9a-f]+|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)

Doh, that's still not quite right.

pattern = r'\b0{1,3}([1-9a-f][0-9a-f]*|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)

MRAB · Dec 16, 2011

How about this:

result = re.sub(r"\b0+(\d)\b", r"\1", string)

Click to expand...

Close.

pattern = r'\b0+([1-9a-f]+|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)

Ah, OK.

The OP said "digit" instead of "hex digit". That's my excuse.

Roy Smith · Dec 16, 2011

John Gordon said:
I'm working with IPv6 CIDR strings, and I want to replace the longest
match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
the leftmost match, even if there is a longer match later in the string.

I'm also looking for a regexp that will remove leading zeroes in each
four-digit group, but will leave a single zero if the group was all
zeroes.

Having done quite a bit of IPv6 work, my opinion here is that you're
trying to do The Wrong Thing.

What you want is an IPv6 class which represents an address in some
canonical form. It would have constructors which accept any of the
RFC-2373 defined formats. It would also have string formatting methods
to convert the internal form into any of these formats.

Then, instead of attempting to regex your way directly from one string
representation to another, you would do something like:

addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321"
print IPv6(addr_string).to_short_form()

John Gordon · Dec 16, 2011

In said:
You could use re.finditer to find the longest match, and then replace
it manually by hand (via string slicing).

(a match is the longest if (m.end() - m.start()) is the largest --
so, max(re.finditer(...), key=3Dlambda m: (m.end() =3D m.start()))

I ended up doing something similar:

# find the longest match
longest_match = ''
for word in re.findall('((0000:?)+)', ip6):
if len(word[0]) > len(longest_match):
longest_match = word[0]

# if we found a match, replace it with a colon
if longest_match:
ip6 = re.sub(longest_match, ':', ip6, 1)

Thanks!

John Gordon · Dec 16, 2011

pattern = r'\b0{1,3}([1-9a-f][0-9a-f]*|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)

Perfect. Thanks Ian!

John Gordon · Dec 16, 2011

In said:
Having done quite a bit of IPv6 work, my opinion here is that you're
trying to do The Wrong Thing.

What you want is an IPv6 class which represents an address in some
canonical form. It would have constructors which accept any of the
RFC-2373 defined formats. It would also have string formatting methods
to convert the internal form into any of these formats.

Then, instead of attempting to regex your way directly from one string
representation to another, you would do something like:

addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321"
print IPv6(addr_string).to_short_form()

This does sound like a more robust solution. I'll give it some thought.
Thanks Roy!

MRAB · Dec 16, 2011

In said:
In said:

You could use re.finditer to find the longest match, and then replace
it manually by hand (via string slicing).

Click to expand...

(a match is the longest if (m.end() - m.start()) is the largest --
so, max(re.finditer(...), key=3Dlambda m: (m.end() =3D m.start()))

Click to expand...

I ended up doing something similar:

# find the longest match
longest_match = ''
for word in re.findall('((0000:?)+)', ip6):
if len(word[0])> len(longest_match):
longest_match = word[0]

# if we found a match, replace it with a colon
if longest_match:
ip6 = re.sub(longest_match, ':', ip6, 1)

For a simple replace, using re is probably overkill. The .replace
method is a better solution:

ip6 = longest_match.replace(ip6, ':', 1)

Terry Reedy · Dec 16, 2011

What you want is an IPv6 class which represents an address in some
canonical form. It would have constructors which accept any of the
RFC-2373 defined formats. It would also have string formatting methods
to convert the internal form into any of these formats.

Then, instead of attempting to regex your way directly from one string
representation to another, you would do something like:

addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321"
print IPv6(addr_string).to_short_form()

There are at least 2 third-party IP classes in use. I would not be
surprised if at least one of them does this.

ting · Dec 19, 2011

I'm working with IPv6 CIDR strings, and I want to replace the longest
match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
the leftmost match, even if there is a longer match later in the string.

Typically this means that your regular expression is not specific
enough.

That is, if you get multiple matches, and you need to sort through
those matches before performing a replace, it usually means that you
should rewrite your expression to get a single match.

Invariably this happens when you try to take short cuts. I can't blame
you for using a short cut, as sometimes short cuts just work, but once
you find that your short cut fails, you need to step back and rethink
the problem, rather than try to hack your short cut.

I don't know what you are doing, but off the top of my head, I'd check
to see if the CIDR string is wrapped in a header message and include
the header as part of the search pattern, or if you know the IPv6
strings are interspersed with IPv4 strings, I would rewrite the regex
to exclude IPv4 strings.

Ian Kelly · Dec 20, 2011

Typically this means that your regular expression is not specific
enough.

That is, if you get multiple matches, and you need to sort through
those matches before performing a replace, it usually means that you
should rewrite your expression to get a single match.

Invariably this happens when you try to take short cuts. I can't blame
you for using a short cut, as sometimes short cuts just work, but once
you find that your short cut fails, you need to step back and rethink
the problem, rather than try to hack your short cut.

The problem isn't short cuts. To narrow down multiple matches to a
single longest match here, two additional criteria need to be met:
there must be no other match anywhere in the search string that is
longer than the match being considered, and there must be no match of
equal length preceding the match being considered.

Note that in the general case, the language is not regular and it
would not even be possible to get a single longest match using a
regular expression. For IPv6, it is possible only because IPv6
addresses are bounded in length. In English, that regular expression
would be constructed like this:

Any block of four or more groups of zeroes OR
Zero or more blocks containing (zero to two groups of zero followed by
a non-zero group) followed by a block of three or more groups of
zeroes followed by zero or more blocks containing (a non-zero group
followed by zero to three groups of zeroes) OR
Zero or more blocks containing (an optional zero group followed by a
non-zero group) followed by a block of two or more groups of zeroes
followed by zero or more blocks containing (a non-zero group followed
by zero to to two groups of zeroes) OR
Zero or more non-zero groups followed by a group of zeroes followed by
zero or more blocks containing (a non-zero group followed by an
optional zero group).

If anyone wants to give a crack at translating that to an actual
regular expression, I'd be interested in seeing it. The added
complexity is so great, though, that I for one would vastly prefer the
simple solution already proposed of getting all the matches and
iterating to find the longest.

Cheers,
Ian

re.sub do not replace portion of match	1	Oct 2, 2009
[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
match with string instead of regexp fails	2	Oct 11, 2006
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011
questions of idiom	3	Jun 7, 2010
anybody help me	1	Feb 10, 2006
FAQ 4.32 How do I strip blank space from the beginning/end of a string?	0	Feb 25, 2011

re.sub(): replace longest match instead of leftmost match?

John Gordon

Devin Jeanpierre

MRAB

Ian Kelly

Ian Kelly

MRAB

Roy Smith

John Gordon

John Gordon

John Gordon

MRAB

Terry Reedy

ting

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads