RegEx engine returning empty matches between valid tokens.

  • Thread starter John otac0n Gietzen
  • Start date
J

John otac0n Gietzen

Dear RegEx Gurus,

I am writing an application to evaluate mathematics functions. The
first step in the process of creating the expressions is tokenizing the
input. I decided to use one large regular expression to preform this
tokenization:

~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~

Now, according to my intuition, this should work. However, any time a
single character that is not explicitly recognized as a token comes by,
the regex engine returns two matches: one empty and one of the correct
character.

To simplify this odd behavior, I have prepared the following example:

Match the string
abcdefghijklmnop
to the expression
~\G(a|b|c*|\w)~

This "anomaly" is seen in the Perl, PHP, and C# regex engines (which
makes me think that it is expected behavior). The final destination
for this regex is C#, so I can not just ignore null entries. (The C#
regex engine stops after the first null match.) Any help or advice
would be much appreciated.

Sincerely,
John "Otac0n" Gietzen
 
X

Xicheng

John said:
Dear RegEx Gurus,

I am writing an application to evaluate mathematics functions. The
first step in the process of creating the expressions is tokenizing the
input. I decided to use one large regular expression to preform this
tokenization:

~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~
Now, according to my intuition, this should work. However, any time a
single character that is not explicitly recognized as a token comes by,
the regex engine returns two matches: one empty and one of the correct
character

To simplify this odd behavior, I have prepared the following example:

Match the string
abcdefghijklmnop
to the expression
~\G(a|b|c*|\w)~
when you make "c*" as an alternation, the regex actually does like
this:

~\G(a|b|c+||\w)~

so you have five choices(instead of four), one of which is NULL which
always takes a place between two characters. if you do want one or
multiple "c" to show in your matched text, use "c+" instead of "c*"..

Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top