Help needed with tough regular expression matching

R

Ramon F Herrera

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate experts. I am sure some of you can devise regexes in your
sleep. :)

---------------------------

const char hyphen = '-';
const char left_paren = '(';
const char right_paren = ')';
const char bar = '|';
const char comma = ',';
const char star = '*';

const string number = "[0-9]+";
const string range = number + hyphen + number;
const string term = left_paren + number + bar + range +
right_paren;
const string sequence = term + bar + left_paren + term + comma +
right_paren + star + term;
 
R

Ramon F Herrera

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous  and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate  experts. I am sure some of you can devise regexes in your
sleep. :)

---------------------------

const char hyphen         = '-';
const char left_paren     = '(';
const char right_paren    = ')';
const char bar            = '|';
const char comma          = ',';
const char star           = '*';

const string number       = "[0-9]+";
const string range        = number + hyphen + number;
const string term         = left_paren + number + bar + range +
right_paren;
const string sequence     = term + bar + left_paren + term + comma +
right_paren + star + term;

I guess the first question is this: is my initial (based on plain old
integers) version already posted correct? Are there any comments about
it? The actual program is working properly.

Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.

I still owe you the actual question.

-Ramon
 
S

sln

Hello all,

My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous  and -some times- succinct:

1,2,3,4,8-10,11-14

which is equivalent to:

1-4,8-14

(and to may other possibilities, the one immediately above being the
most succinct possible)

I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.

My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.

-Ramon

ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate  experts. I am sure some of you can devise regexes in your
sleep. :)

---------------------------

const char hyphen         = '-';
const char left_paren     = '(';
const char right_paren    = ')';
const char bar            = '|';
const char comma          = ',';
const char star           = '*';

const string number       = "[0-9]+";
const string range        = number + hyphen + number;
const string term         = left_paren + number + bar + range +
right_paren;
const string sequence     = term + bar + left_paren + term + comma +
right_paren + star + term;

I guess the first question is this: is my initial (based on plain old ^^^^^^^^^^^^^^
integers) version already posted correct? Are there any comments about
it? The actual program is working properly.

Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.

I still owe you the actual question.
^^^^^^^^^^^^^^^
Is this the first question or the actual question?
You will have to translate this string into something readable
with an explanation of how it works.

-sln
 
R

Ramon F Herrera

Hello all,
My initial task was to match the well-known notation used in math and
other fields to denote integer sequences. Such notation is nice,
unambiguous  and -some times- succinct:
1,2,3,4,8-10,11-14
which is equivalent to:
1-4,8-14
(and to may other possibilities, the one immediately above being the
most succinct possible)
I have been able to come up with a regular expression which
unequivocally matches any valid "statement" of such "language". See
below.
My problem is that now I would like to write a more general form of
such sequence matching. The "base unit" is not an integer anymore. I
will post the actual question next.
-Ramon
ps: You will notice that this is not written in Perl, but in C++ and
will be used in some sort of "Perl emulator" (actually a Regex package
for C++). For obvious reason, a Perl NG is the best place to find the
appropriate  experts. I am sure some of you can devise regexes in your
sleep. :)
---------------------------
const char hyphen         = '-';
const char left_paren     = '(';
const char right_paren    = ')';
const char bar            = '|';
const char comma          = ',';
const char star           = '*';
const string number       = "[0-9]+";
const string range        = number + hyphen + number;
const string term         = left_paren + number + bar + range +
right_paren;
const string sequence     = term + bar + left_paren + term + comma +
right_paren + star + term;
I guess the first question is this: is my initial (based on plain old

             ^^^^^^^^^^^^^^>integers) version already posted correct? Are there any comments about
it? The actual program is working properly.
Once again, I got into trouble when I tried to develop a more general
version, not based on integers but in "base units" slightly more
complicated. I keep on thinking that my problem is the parentheses
(excess or deficit of them). I can never figure those out.
I still owe you the actual question.

                     ^^^^^^^^^^^^^^^
Is this the first question or the actual question?
You will have to translate this string into something readable
with an explanation of how it works.

-sln


Okay, sorry about the pre-questioning and post-questioning caches. :)

Too many lookahead buffers, but at least now I know that somebody is
reading this with interest...

My definite question is about the notation that I am trying to
implement.

(1) Older version is based on a sequence of integers, at it works as
expected with the regexes already posted:

1,5,7-11,13

(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

-Ramon
 
J

Jürgen Exner

Ramon F Herrera said:
(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

I will never understand the obsession involved with regular expressions.
Of course you CAN use them to parse the above syntax, but why would you
insist on doing so? Your syntax are so simple, a trivial
split/,/
will give you a sequence of ranges already, some of them representing
single quarters. And that is all you need.

And then just walk through them one by one and check if the beginning
quarter of the next entry is one larger than the ending quarter of the
current entry(*). If so then merge the two.

*: this one is the only function that requires a tiny bit of
intelligence. But as your data seems to be in chonological sequence you
don't even have to deal with overlaps or included subsets or any of
those things, that hit you when you least expect it.

jue
 
R

Ramon F Herrera

Quoth Ramon F Herrera <[email protected]>:




(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

(meaning the 4th. quarter of year 2008, of course).
Therefore the notation (I don't allow spaces, btw) will look like
this:

Which could be minimally expressed like this:
2/2000-4/2000,1/2005-1/2006

Right... so you want

    my $qtr     = qr! [1234] / \d{4}        !x;
    my $range   = qr! $qtr - $qtr           !x;
    my $item    = qr! $qtr | $range         !x;
    my $expr    = qr! $item (?: , $item )*  !x;

Translating that into <whatever it is you're using> is left as an
exercise.
Why couldn't you just ask the straightforward question first?

Frankly, I wasn't even sure anybody would be reading -let alone
answering- this. I was doing the equivalent of "knock, knock: anyone
there?". As an Usenet pioneer, it saddens me to see how much its
quality has decreased over the years.

One reason for such "disprovement" is positive (NGs replaced by Google
search) and the other is negative (SPAM).

Having said that, the Perl NG seems to remain a useful place!

-Ramon
 
R

Ramon F Herrera

I will never understand the obsession involved with regular expressions.
Of course you CAN use them to parse the above syntax, but why would you
insist on doing so? Your syntax are so simple, a trivial
        split/,/
will give you a sequence of ranges already, some of them representing
single quarters. And that is all you need.

And then just walk through them one by one and check if the beginning
quarter of the next entry is one larger than the ending quarter of the
current entry(*). If so then merge the two.

*: this one is the only function that requires a tiny bit of
intelligence. But as your data seems to be in chonological sequence you
don't even have to deal with overlaps or included subsets or any of
those things, that hit you when you least expect it.

jue

Hi Jurgen,

The first thing they told me in CS classes was: "write general
functions, you never know when you are going to need a better
version". The input data (week sequences) is indeed going to be
provided by a program of mine (in the Windows client side, the current
problem is for the Linux serve side) BUT I will refuse to rely on it
being nicely ordered. I may hire another programmer at some point.
These expressions are going to become more complex as my software
deals with real-life data and I figured that starting with simple
integers would be a good start.

After the quarterly problem I intend to attack harder ones, such as
the dates in reverse order, its syntax and semantics.

My objective/goal is to sent the data from the client to the server in
the most succinct manner and to leave all heavy number crunching on
the server side (thin client).

Plus, this way I prevent Alzheimer or whatever similar stuff and keep
my programming brain in some sort of shape. I also study foreign
languages for absolutely no real reason.

-Ramon
 
M

Martijn Lievaart

The first thing they told me in CS classes was: "write general
functions, you never know when you are going to need a better version".

Bad advice. Write what you need, with extendability in the back of your
mind.
The input data (week sequences) is indeed going to be provided by a
program of mine (in the Windows client side, the current problem is for
the Linux serve side) BUT I will refuse to rely on it being nicely
ordered.

Good. Always, ALWAYS, check your inputs.
I may hire another programmer at some point. These expressions
are going to become more complex as my software deals with real-life
data and I figured that starting with simple integers would be a good
start.

The algorithm by Jurgen is a very good aproach and gives you the
flexibility needed.

First read in your data to an intermediate format you can handle better,
then handle the data.
After the quarterly problem I intend to attack harder ones, such as the
dates in reverse order, its syntax and semantics.

Jurgens algorithm can handle all these, it becpomes a simple matter of
programming.
My objective/goal is to sent the data from the client to the server in
the most succinct manner and to leave all heavy number crunching on the
server side (thin client).

With today's computing power even on most thin clients (I do know
exceptions) this reason is not the best of reasons. Having all validation
being done in one place, is a good reason, if possible. It may not be
possible, in fact it may be a good strategy to parse client side to give
friendly error messages to the user and parse server side to be
absolutely sure that you don't get fed rubbish in your database.

M4
 
R

Ramon F Herrera

Ramon F Herrera said:
[...]
My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

If you don't post it, we can't take a look at it (at least I
do not believe in telepathy). If you post it, please do not
use yet another grammar, but a valid Perl regular expres-
sion, i. e. the output of "printf ("%s\n", sequence);" or
something like it.

Tim

Tim & Ben,

I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

I still have to run more comprehensive tests, but this is what I have
so far. In both "high level language" and "assembly language". :)

It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

-Ramon

--------------------------------

const string left_group = "(?:";
const string year_period = "[1-4]";
const string quarter = year_period + slash + year;
const string range = quarter + hyphen + quarter;
const string term = quarter + bar + range;
const string xterm = left_paren + term + right_paren;
const string sequence = xterm + left_group + comma + xterm +
right_paren + star;

sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})
(?:,([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-
4]/[12][0-9]{3}))*
 
R

Ramon F Herrera

Quoth Ramon F Herrera <[email protected]>:


I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

(?: ) is equivalent to ( ) except it doesn't capture what it matches. If
you are only concerned about matching/non-matching, they are equivalent.
It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

qr// creates a compiled regex object. When one of these is interpolated
into another regex, it is surrounded by (?: ) so that it remains
self-contained. (Strictly, (?-xism: ), to preserve the flags that were
on the original qr//.)
sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})


Hmm, I see you are already setting yourself up for y3k bugs...

Ben

Not to mention the fact that my program will fail to keep track of
invoices from the pre-Camelot era...

-Ramon
 
R

Ramon F Herrera

Quoth Ramon F Herrera <[email protected]>:


I believe the problem has been solved. Thanks for your kind
assistance. This is the 2nd. recent time I get stuck with a regex and
this is the 2nd time that the weird-looking "(?:" grouping operator
comes to my rescue.

(?: ) is equivalent to ( ) except it doesn't capture what it matches. If
you are only concerned about matching/non-matching, they are equivalent.
It seems that the critical part was adding an "xterm" which is a
regular "term" surrounded by parentheses. For some reason, the Perl
expression provided by Ben does not require that set of parentheses.

qr// creates a compiled regex object. When one of these is interpolated
into another regex, it is surrounded by (?: ) so that it remains
self-contained. (Strictly, (?-xism: ), to preserve the flags that were
on the original qr//.)
sequence: ([1-4]/[12][0-9]{3}|[1-4]/[12][0-9]{3}-[1-4]/[12][0-9]{3})


Hmm, I see you are already setting yourself up for y3k bugs...

I have always said that the "Millennium Bug" is a complete misnomer.

- It is not a bug, but a choice
- It is not related to millennia, but to centuries.

It so happens that the very first change of centuries for computers
coincided with a millennium shift, but that was just a coincidence.

Next time humanity will face the problem will be in 2100.

-Ramon
 
S

sln

(2) My future feature will be based on "quarters" instead of integers.
This is an instance of the most basic unit:

4/2008

(meaning the 4th. quarter of year 2008, of course).

Therefore the notation (I don't allow spaces, btw) will look like
this:

2/2000,3/2000,4/2000,1/2005,2/2005-1/2006

Which could be minimally expressed like this:

2/2000-4/2000,1/2005-1/2006

My failed attempt at dealing with calendar quarters is lifted entirely
from the integer version, but for some reasons it doesn't seem to
scale well.

-Ramon

Below is my Validation Special version ..
-sln

--------------------
use strict;
use warnings;

my @qtrs = ();

while (<DATA>)
{
@qtrs = m{ (?:^|,) ([1-4]) / ([12]\d{3}) (?=[,-]|$) .? ((?<=-)[1-4]|) .? ((?<=/)[12]\d{3}|) (?=$|,) }xg;
if (@qtrs) {
print "-------\n";
while ( my ($q1,$y1,$q2,$y2) = map {defined() && length() ? $_ : 0} splice (@qtrs, 0, 4) )
{
printf "q:%2d y:%5d To q:%2d y:%5d\n", $q1,$y1, $q2,$y2;
}
}
# OR ...

@qtrs = m{ (?:^|,) ([1-4]) / ([12]\d{3}) (?=[,-]|$) .? (?:-([1-4]) / ([12]\d{3}))? (?=$|,) }xg;
if (@qtrs) {
print " ==\n";
while ( my ($q1,$y1,$q2,$y2) = map {defined() && length() ? $_ : 0} splice (@qtrs, 0, 4) )
{
printf "q:%2d y:%5d To q:%2d y:%5d\n", $q1,$y1, $q2,$y2;
}
}
}

__DATA__

1/1900

2/2000,3/2000,4/20073/2008,1/2005,2/2005-1/2006 asdf 2/2000-4/2000,1/2005-1/2006 wsefrgsd 2/2000-4/2000,1/2005-1/2006asdf,1/2008-2/2009

1/2000-2/2000, ,,3/2005-4/2006,2/2000

2/2000-4/2000, ,,6/2005-1/2006

1/2100-4/2200
2/2100-
,3/2100-
4/2100aadsfasdf

( [1-4]/[12][0-9]{3} | [1-4]/[12][0-9]{3} - [1-4]/[12][0-9]{3} )
(?: ,
( [ 1-4]/[12][0-9]{3} | [1-4]/[12][0-9]{3} - [1-4]/[12][0-9]{3} )
)*
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top