Regular expressions

Z

Zeba

Hi guys,

I need some help regarding regular expressions. Consider the following
statement :

System.Text.RegularExpressions.Match match =
System.Text.RegularExpressions.Regex.Match(requestPath, "([^/]*?\
\.ashx)");

(where requestPath is a string)

What does the regex: [^/]*?\\.ashx actually do ? How come * and ?
occur consecutively ?
Doesn't '?' require some text/block of text before it ?
Also, is the expression read left to right or right to left ?
i.e. is the backslash grouped as '\\'. or \' \ .' ? If it is the
former, why is it not written as \\\. and if latter what does the
orphaned backslash do ?

Hope that's not too many questions - I'm too confused !

Thanks !
 
Z

Zeba

Oops, I guess that should go to the CSharp forum, but do let me know
if you can help me.

Thanks !
 
K

Kevin Spencer

Regular Expressions are a powerful way to match patterns of characters in
strings.

The Regular Expression engine is basically procedural in nature, examining a
string one character at a time, but although it moves from left to right
through the string, it has the capability to move (jump) backwards as well,
and to keep track of multiple matches, groups, and so on.

What it does is to use a syntax that identifies sequences of characters in a
string. In your example,

[^/]*?

is essentially what is called a "character class." A character class is a
set of matching characters which can appear in any order, and a match can
contain any of the characters. The characters in the set are identified by
the [square brackets] surrounding them. The character '^' indicates a "NOT"
grouping, which means that a match may NOT contain any of the characters in
the set. The '/' character is the only character in this particular set.

The character following the character class is a quantifier. It indicates
how many characters in the set constitute a match. The '*' character
signifies "zero or more." Some other quantifiers are '+' (one ore more), '?'
(zero or one), and sets of numbers in curly brackets, for example: {2}
(exactly 2), {1,5} (between 1 and 5 inclusive).

The '?' following the '*' in this case is NOT a quantifier. It is determined
by its' context in the pattern. If it immediately followed the character
class it would be a quantifier, but because it follows the quantifier, it
modifies the quantifier. It indicates that the character set is "lazy" as
opposed to "greedy." This is a little harder to explain. Regular Expressions
are "greedy" by default. That is, if a string contains a continuous set of
characters that constitute a match, followed by one or more continuous
characters that constitute a match, the matches are combined into a single
match, for as many times as there are sets of continuous matching
characters.

For example, if you are looking for an HTML tag in a document, you might
think the following would work:

<.+> (a left angle bracket, followed by any non-line-break character one or
more times, followed by a right angle bracket)

If you were looking at the following HTML:

<a href="blah">Click Here</a>

You might think that it would capture the opening tag. However, it would
capture the entire string. Why? Because the right angle-bracket in the
opening tag is not a line-break character. Yes, the match MUST end in a
right angle bracket. However, since RegEx is greedy, it will continue until
it finds a character that does NOT match the expression.

If you were to use the following instead:

<.+?>

It would stop at the first right-angle bracket. This is because the '?'
means that as few non-line-break characters as possible should match before
the right angle bracket.

You could also do the following:

<[^>]+>

This means that any right angle bracket character can not be part of the
match prior to the right angle bracket at the end of the match.

Here's a good reference on using Regular Expressions with the .Net platform:

http://msdn2.microsoft.com/en-us/library/hs600312.aspx

--
HTH,

Kevin Spencer
Microsoft MVP

Help test our new betas,
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

Zeba said:
Hi guys,

I need some help regarding regular expressions. Consider the following
statement :

System.Text.RegularExpressions.Match match =
System.Text.RegularExpressions.Regex.Match(requestPath, "([^/]*?\
\.ashx)");

(where requestPath is a string)

What does the regex: [^/]*?\\.ashx actually do ? How come * and ?
occur consecutively ?
Doesn't '?' require some text/block of text before it ?
Also, is the expression read left to right or right to left ?
i.e. is the backslash grouped as '\\'. or \' \ .' ? If it is the
former, why is it not written as \\\. and if latter what does the
orphaned backslash do ?

Hope that's not too many questions - I'm too confused !

Thanks !
 
Z

Zeba

Thanks ! That was very helpful.

Regular Expressions are a powerful way to match patterns of characters in
strings.

The Regular Expression engine is basically procedural in nature, examining a
string one character at a time, but although it moves from left to right
through the string, it has the capability to move (jump) backwards as well,
and to keep track of multiple matches, groups, and so on.

What it does is to use a syntax that identifies sequences of characters in a
string. In your example,

[^/]*?

is essentially what is called a "character class." A character class is a
set of matching characters which can appear in any order, and a match can
contain any of the characters. The characters in the set are identified by
the [square brackets] surrounding them. The character '^' indicates a "NOT"
grouping, which means that a match may NOT contain any of the characters in
the set. The '/' character is the only character in this particular set.

The character following the character class is a quantifier. It indicates
how many characters in the set constitute a match. The '*' character
signifies "zero or more." Some other quantifiers are '+' (one ore more), '?'
(zero or one), and sets of numbers in curly brackets, for example: {2}
(exactly 2), {1,5} (between 1 and 5 inclusive).

The '?' following the '*' in this case is NOT a quantifier. It is determined
by its' context in the pattern. If it immediately followed the character
class it would be a quantifier, but because it follows the quantifier, it
modifies the quantifier. It indicates that the character set is "lazy" as
opposed to "greedy." This is a little harder to explain. Regular Expressions
are "greedy" by default. That is, if a string contains a continuous set of
characters that constitute a match, followed by one or more continuous
characters that constitute a match, the matches are combined into a single
match, for as many times as there are sets of continuous matching
characters.

For example, if you are looking for an HTML tag in a document, you might
think the following would work:

<.+> (a left angle bracket, followed by any non-line-break character one or
more times, followed by a right angle bracket)

If you were looking at the following HTML:

<a href="blah">Click Here</a>

You might think that it would capture the opening tag. However, it would
capture the entire string. Why? Because the right angle-bracket in the
opening tag is not a line-break character. Yes, the match MUST end in a
right angle bracket. However, since RegEx is greedy, it will continue until
it finds a character that does NOT match the expression.

If you were to use the following instead:

<.+?>

It would stop at the first right-angle bracket. This is because the '?'
means that as few non-line-break characters as possible should match before
the right angle bracket.

You could also do the following:

<[^>]+>

This means that any right angle bracket character can not be part of the
match prior to the right angle bracket at the end of the match.

Here's a good reference on using Regular Expressions with the .Net platform:

http://msdn2.microsoft.com/en-us/library/hs600312.aspx

--
HTH,

Kevin Spencer
Microsoft MVP

Help test our new betas,
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net


I need some help regarding regular expressions. Consider the following
statement :
System.Text.RegularExpressions.Match match =
System.Text.RegularExpressions.Regex.Match(requestPath, "([^/]*?\
\.ashx)");
(where requestPath is a string)
What does the regex: [^/]*?\\.ashx actually do ? How come * and ?
occur consecutively ?
Doesn't '?' require some text/block of text before it ?
Also, is the expression read left to right or right to left ?
i.e. is the backslash grouped as '\\'. or \' \ .' ? If it is the
former, why is it not written as \\\. and if latter what does the
orphaned backslash do ?
Hope that's not too many questions - I'm too confused !
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top