regex walktrough

R

rh

Look through some code I found this and wondered about what it does:
^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$

Here's my walk through:

1) ^ match at start of string
2) ?P<salsipuedes> if a match is found it will be accessible in a variable
salsipuedes
3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
4) + one or more from the preceeding char class
5) () the grouping we want returned (see #2)
6) $ end of the string to match against but before any newline


more on #3
the z-_ part looks wrong and seems that the - should be at the start
of the char set otherwise we get another range z-_ or does the a-z
preceeding the z-_ negate the z-_ from becoming a range? The "."
might be ok inside a char set. The two slashes look wrong but maybe
it has some special meaning in some case? I think only one slash is
needed.

I've looked at pydoc re, but it's cursory.
 
H

Hans Mulder

Look through some code I found this and wondered about what it does:
^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$

Here's my walk through:

1) ^ match at start of string
2) ?P<salsipuedes> if a match is found it will be accessible in a
variable salsipuedes

I wouldn't call it a variable. If m is a match-object produced
by this regex, then m.group('salsipuedes') will return the part
that was captured.

I'm not sure, though, why you'd want to define a group that
effectively spans the whole regex. If there's a match, then
m.group(0) will return the matching substring, and
m.group('salsipuedes') will return the substring that matched
the parenthesized part of the pattern and these two substrings
will be equal, since the only bits of the pattern outside the
parenthesis are zero-width assertions.
3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
4) + one or more from the preceeding char class
5) () the grouping we want returned (see #2)
6) $ end of the string to match against but before any newline

more on #3
the z-_ part looks wrong and seems that the - should be at the start
of the char set otherwise we get another range z-_ or does the a-z
preceeding the z-_ negate the z-_ from becoming a range?

The latter: a-z is a range and block the z-_ from being a range.
Consequently, the -_ bit matches only - and _.
The "." might be ok inside a char set.

It is. Most special characters lose their special meaning
inside a char set.
The two slashes look wrong but maybe it has some special meaning
in some case? I think only one slash is needed.

You're correct: there's no special meaning and only one slash
is needed. But then, a char set is a set and duplcates are
simply ignored, so it does no harm.

Perhaps the person who wrote this was confusing slashes and
backslashes.
I've looked at pydoc re, but it's cursory.

That's one way of putting it.


Hope this helps,

-- HansM
 
R

rh

Look through some code I found this and wondered about what it
does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$

Here's my walk through:

1) ^ match at start of string
2) ?P<salsipuedes> if a match is found it will be accessible in a
variable salsipuedes

I wouldn't call it a variable. If m is a match-object produced
by this regex, then m.group('salsipuedes') will return the part
that was captured.

I'm not sure, though, why you'd want to define a group that
effectively spans the whole regex. If there's a match, then
m.group(0) will return the matching substring, and
m.group('salsipuedes') will return the substring that matched
the parenthesized part of the pattern and these two substrings
will be equal, since the only bits of the pattern outside the
parenthesis are zero-width assertions.

Good point, it's making the re engine do extra work.
It's not my code and that's another gap in the author's proficiency.
(I don't know who the author is....FWIW)
3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
below
4) + one or more from the preceeding char class
5) () the grouping we want returned (see #2)
6) $ end of the string to match against but before any newline

more on #3
the z-_ part looks wrong and seems that the - should be at the start
of the char set otherwise we get another range z-_ or does the a-z
preceeding the z-_ negate the z-_ from becoming a range?

The latter: a-z is a range and block the z-_ from being a range.
Consequently, the -_ bit matches only - and _.
The "." might be ok inside a char set.

It is. Most special characters lose their special meaning
inside a char set.
The two slashes look wrong but maybe it has some special meaning
in some case? I think only one slash is needed.

You're correct: there's no special meaning and only one slash
is needed. But then, a char set is a set and duplcates are
simply ignored, so it does no harm.

I wonder if there's harm in the performance. Probably not
but regex is some tricky code and can be expensive even when written
well. For example does this perform better than the original:
^(?P<salsipuedes>[-\w./]+)$

Not sure if the \w sequence includes the - or the . or the /
I think it does not.
Perhaps the person who wrote this was confusing slashes and
backslashes.
Possibly.


That's one way of putting it.


Hope this helps,

Does help, thanks.


--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top