Help needed with tough regular expression matching

Ramon F Herrera · Oct 12, 2009

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate experts. I am sure some of you can devise regexes in your
sleep.

---------------------------

const char hyphen = '-';
const char left_paren = '(';
const char right_paren = ')';
const char bar = '|';
const char comma = ',';
const char star = '*';

const string number = "[0-9]+";
const string range = number + hyphen + number;
const string term = left_paren + number + bar + range +
right_paren;
const string sequence = term + bar + left_paren + term + comma +
right_paren + star + term;

Ramon F Herrera · Oct 12, 2009

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate experts. I am sure some of you can devise regexes in your
sleep.

---------------------------

const char hyphen = '-';
const char left_paren = '(';
const char right_paren = ')';
const char bar = '|';
const char comma = ',';
const char star = '*';

const string number = "[0-9]+";
const string range = number + hyphen + number;
const string term = left_paren + number + bar + range +
right_paren;
const string sequence = term + bar + left_paren + term + comma +
right_paren + star + term;

I guess the first question is this: is my initial (based on plain old
integers) version already posted correct? Are there any comments about
it? The actual program is working properly.

Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.

I still owe you the actual question.

-Ramon

sln · Oct 12, 2009

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate experts. I am sure some of you can devise regexes in your
sleep.

---------------------------

const char hyphen = '-';
const char left_paren = '(';
const char right_paren = ')';
const char bar = '|';
const char comma = ',';
const char star = '*';

const string number = "[0-9]+";
const string range = number + hyphen + number;
const string term = left_paren + number + bar + range +
right_paren;
const string sequence = term + bar + left_paren + term + comma +
right_paren + star + term;

Click to expand...

I guess the first question is this: is my initial (based on plain old ^^^^^^^^^^^^^^
integers) version already posted correct? Are there any comments about
it? The actual program is working properly.

Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.

I still owe you the actual question.

^^^^^^^^^^^^^^^
Is this the first question or the actual question?
You will have to translate this string into something readable
with an explanation of how it works.

-sln

Ramon F Herrera · Oct 12, 2009

Hello all,
My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous and -some times- succinct:
1,2,3,4,8-10,11-14
which is equivalent to:
1-4,8-14
(and to may other possibilities, the one immediately above being the
most succinct possible)
I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.
My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.
-Ramon
ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate experts. I am sure some of you can devise regexes in your
sleep.
---------------------------
const char hyphen = '-';
const char left_paren = '(';
const char right_paren = ')';
const char bar = '|';
const char comma = ',';
const char star = '*';
const string number = "[0-9]+";
const string range = number + hyphen + number;
const string term = left_paren + number + bar + range +
right_paren;
const string sequence = term + bar + left_paren + term + comma +
right_paren + star + term;

Click to expand...

Click to expand...

I guess the first question is this: is my initial (based on plain old

Click to expand...

^^^^^^^^^^^^^^>integers) version already posted correct? Are there any comments about

it? The actual program is working properly.

Click to expand...

Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.

Click to expand...

I still owe you the actual question.

Click to expand...

^^^^^^^^^^^^^^^
Is this the first question or the actual question?
You will have to translate this string into something readable
with an explanation of how it works.

-sln

Okay, sorry about the pre-questioning and post-questioning caches.

Too many lookahead buffers, but at least now I know that somebody is
reading this with interest...

My definite question is about the notation that I am trying to
implement.

(1) Older version is based on a sequence of integers, at it works as
expected with the regexes already posted:

1,5,7-11,13

(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

-Ramon

Jürgen Exner · Oct 12, 2009

Ramon F Herrera said:
(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

I will never understand the obsession involved with regular expressions.
Of course you CAN use them to parse the above syntax, but why would you
insist on doing so? Your syntax are so simple, a trivial
split/,/
will give you a sequence of ranges already, some of them representing
single quarters. And that is all you need.

And then just walk through them one by one and check if the beginning
quarter of the next entry is one larger than the ending quarter of the
current entry(*). If so then merge the two.

*: this one is the only function that requires a tiny bit of
intelligence. But as your data seems to be in chonological sequence you
don't even have to deal with overlaps or included subsets or any of
those things, that hit you when you least expect it.

jue

Ramon F Herrera · Oct 12, 2009

Quoth Ramon F Herrera <[email protected]>:

(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

(meaning the 4th. quarter of year 2008, of course).

Click to expand...

Therefore the notation (I don't allow spaces, btw) will look like
this:

Which could be minimally expressed like this:

Click to expand...

2/2000-4/2000,1/2005-1/2006

Click to expand...

Right... so you want

my $qtr = qr! [1234] / \d{4} !x;
my $range = qr! $qtr - $qtr !x;
my $item = qr! $qtr | $range !x;
my $expr = qr! $item (?: , $item )* !x;

Translating that into <whatever it is you're using> is left as an
exercise.

Why couldn't you just ask the straightforward question first?

Frankly, I wasn't even sure anybody would be reading -let alone
answering- this. I was doing the equivalent of "knock, knock: anyone
there?". As an Usenet pioneer, it saddens me to see how much its
quality has decreased over the years.

One reason for such "disprovement" is positive (NGs replaced by Google
search) and the other is negative (SPAM).

Having said that, the Perl NG seems to remain a useful place!

-Ramon

Ramon F Herrera · Oct 12, 2009

I will never understand the obsession involved with regular expressions.
Of course you CAN use them to parse the above syntax, but why would you
insist on doing so? Your syntax are so simple, a trivial
split/,/
will give you a sequence of ranges already, some of them representing
single quarters. And that is all you need.

And then just walk through them one by one and check if the beginning
quarter of the next entry is one larger than the ending quarter of the
current entry(*). If so then merge the two.

*: this one is the only function that requires a tiny bit of
intelligence. But as your data seems to be in chonological sequence you
don't even have to deal with overlaps or included subsets or any of
those things, that hit you when you least expect it.

jue

Hi Jurgen,

The first thing they told me in CS classes was: "write general
functions, you never know when you are going to need a better
version". The input data (week sequences) is indeed going to be
provided by a program of mine (in the Windows client side, the current
problem is for the Linux serve side) BUT I will refuse to rely on it
being nicely ordered. I may hire another programmer at some point.
These expressions are going to become more complex as my software
deals with real-life data and I figured that starting with simple
integers would be a good start.

After the quarterly problem I intend to attack harder ones, such as
the dates in reverse order, its syntax and semantics.

My objective/goal is to sent the data from the client to the server in
the most succinct manner and to leave all heavy number crunching on
the server side (thin client).

Plus, this way I prevent Alzheimer or whatever similar stuff and keep
my programming brain in some sort of shape. I also study foreign
languages for absolutely no real reason.

-Ramon

Martijn Lievaart · Oct 12, 2009

The first thing they told me in CS classes was: "write general
functions, you never know when you are going to need a better version".

Bad advice. Write what you need, with extendability in the back of your
mind.

The input data (week sequences) is indeed going to be provided by a
program of mine (in the Windows client side, the current problem is for
the Linux serve side) BUT I will refuse to rely on it being nicely
ordered.

Good. Always, ALWAYS, check your inputs.

I may hire another programmer at some point. These expressions
are going to become more complex as my software deals with real-life
data and I figured that starting with simple integers would be a good
start.

The algorithm by Jurgen is a very good aproach and gives you the
flexibility needed.

First read in your data to an intermediate format you can handle better,
then handle the data.

After the quarterly problem I intend to attack harder ones, such as the
dates in reverse order, its syntax and semantics.

Jurgens algorithm can handle all these, it becpomes a simple matter of
programming.

My objective/goal is to sent the data from the client to the server in
the most succinct manner and to leave all heavy number crunching on the
server side (thin client).

With today's computing power even on most thin clients (I do know
exceptions) this reason is not the best of reasons. Having all validation
being done in one place, is a good reason, if possible. It may not be
possible, in fact it may be a good strategy to parse client side to give
friendly error messages to the user and parse server side to be
absolutely sure that you don't get fed rubbish in your database.

M4

Ramon F Herrera · Oct 12, 2009

Ramon F Herrera said:
Ramon F Herrera said:

[...]
My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

Click to expand...

If you don't post it, we can't take a look at it (at least I
do not believe in telepathy). If you post it, please do not
use yet another grammar, but a valid Perl regular expres-
sion, i. e. the output of "printf ("%s\n", sequence);" or
something like it.

Tim

Tim & Ben,

I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

I still have to run more comprehensive tests, but this is what I have
so far. In both "high level language" and "assembly language".

It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

-Ramon

--------------------------------

const string left_group = "(?:";
const string year_period = "[1-4]";
const string quarter = year_period + slash + year;
const string range = quarter + hyphen + quarter;
const string term = quarter + bar + range;
const string xterm = left_paren + term + right_paren;
const string sequence = xterm + left_group + comma + xterm +
right_paren + star;

sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})
(?:,([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-
4]/[12][0-9]{3}))*

Ramon F Herrera · Oct 12, 2009

Quoth Ramon F Herrera <[email protected]>:

I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

Click to expand...

(?: ) is equivalent to ( ) except it doesn't capture what it matches. If
you are only concerned about matching/non-matching, they are equivalent.

It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

Click to expand...

qr// creates a compiled regex object. When one of these is interpolated
into another regex, it is surrounded by (?: ) so that it remains
self-contained. (Strictly, (?-xism: ), to preserve the flags that were
on the original qr//.)

sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})

Click to expand...

Hmm, I see you are already setting yourself up for y3k bugs...

Ben

Not to mention the fact that my program will fail to keep track of
invoices from the pre-Camelot era...

-Ramon

Ramon F Herrera · Oct 12, 2009

Quoth Ramon F Herrera <[email protected]>:

I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

Click to expand...

(?: ) is equivalent to ( ) except it doesn't capture what it matches. If
you are only concerned about matching/non-matching, they are equivalent.

It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

Click to expand...

qr// creates a compiled regex object. When one of these is interpolated
into another regex, it is surrounded by (?: ) so that it remains
self-contained. (Strictly, (?-xism: ), to preserve the flags that were
on the original qr//.)

sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})

Click to expand...

Hmm, I see you are already setting yourself up for y3k bugs...

I have always said that the "Millennium Bug" is a complete misnomer.

- It is not a bug, but a choice
- It is not related to millennia, but to centuries.

It so happens that the very first change of centuries for computers
coincided with a millennium shift, but that was just a coincidence.

Next time humanity will face the problem will be in 2100.

-Ramon

sln · Oct 14, 2009

(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

-Ramon

Below is my Validation Special version ..
-sln

--------------------
use strict;
use warnings;

my @qtrs = ();

while (<DATA>)
{
@qtrs = m{ (?:^|,) ([1-4]) / ([12]\d{3}) (?=[,-]|$) .? ((?<=-)[1-4]|) .? ((?<=/)[12]\d{3}|) (?=$|,) }xg;
if (@qtrs) {
print "-------\n";
while ( my ($q1,$y1,$q2,$y2) = map {defined() && length() ? $_ : 0} splice (@qtrs, 0, 4) )
{
printf "q:%2d y:%5d To q:%2d y:%5d\n", $q1,$y1, $q2,$y2;
}
}
# OR ...

@qtrs = m{ (?:^|,) ([1-4]) / ([12]\d{3}) (?=[,-]|$) .? (?:-([1-4]) / ([12]\d{3}))? (?=$|,) }xg;
if (@qtrs) {
print " ==\n";
while ( my ($q1,$y1,$q2,$y2) = map {defined() && length() ? $_ : 0} splice (@qtrs, 0, 4) )
{
printf "q:%2d y:%5d To q:%2d y:%5d\n", $q1,$y1, $q2,$y2;
}
}
}

__DATA__

1/1900

2/2000,3/2000,4/20073/2008,1/2005,2/2005-1/2006 asdf 2/2000-4/2000,1/2005-1/2006 wsefrgsd 2/2000-4/2000,1/2005-1/2006asdf,1/2008-2/2009

1/2000-2/2000, ,,3/2005-4/2006,2/2000

2/2000-4/2000, ,,6/2005-1/2006

1/2100-4/2200
2/2100-
,3/2100-
4/2100aadsfasdf

( [1-4]/[12][0-9]{3} | [1-4]/[12][0-9]{3} - [1-4]/[12][0-9]{3} )
(?: ,
( [ 1-4]/[12][0-9]{3} | [1-4]/[12][0-9]{3} - [1-4]/[12][0-9]{3} )
)*

Problem with regular expression	4	Mar 7, 2010
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
Match a regular expression	2	Mar 2, 2007
help for regular expression	7	Jan 24, 2008
Regular Expression help	5	Mar 29, 2008
FAQ 6.11 How do I use a regular expression to strip C style comments from a file?	0	Feb 10, 2011
LF: Help with Regular Expression	2	Oct 30, 2007
Regular expression help	4	May 29, 2006

Help needed with tough regular expression matching

Ramon F Herrera

Ramon F Herrera

sln

Ramon F Herrera

Jürgen Exner

Ramon F Herrera

Ramon F Herrera

Martijn Lievaart

Ramon F Herrera

Ramon F Herrera

Ramon F Herrera

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads