regexp(ing) Backus-Naurish expressions ...

Q

qwertmonkey

I need to set up some code's running context via properties files and I want
to make sure that users don't get too playful messing with them, because that
could alter results greatly and in unexpected ways (they must probably won't
be able to make sense of and then they would bother the hell out of you)
~
So, I must do some sanity check the running parameters if entered via the
command prompt or if the defaults are used from the properties files
~
I am telling you all of that because you many know of libraries to do such
thing
~
I think one possible way to do that is via a regexp, which should match all
the options included in the test array aISAr
~
One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be
actually matched another one is that, say, " true ", should be matched, as well
as "false [ nix |mac| windows ] line.separator" ...
~
Any ideas you would share?
~
thanks,
lbrtchx
~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ TEST CODE ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

import java.util.regex.Matcher;
import java.util.regex.Pattern;

// __
public class RegexMatches02Test{
// __
public static void main( String args[] ){
String aRegEx;
String aIS;
Pattern Ptrn;
Matcher Mtchr;
int iCnt, iMtxStart, iMtxEnd;
// __
aRegEx = "^\\s*[true|false|t|f]{1}\\s*\\[";
aRegEx = "^\\s*[true|false|t|f]{1}";
aRegEx = "^\\s*[true|false|t|f]{1}\\s*";
aRegEx = "^\\s*[true|false t|f]{1}\\s*";

// __
String[] aISAr = new String[]{
" true[a|b |c ] q"
, " true [a|b |c ] q"
, "true [a|b |c ] q"
, "true[a|b|c] b"
, "true[a|b|c]q"
, "False[ y | n | q ] q"
, "false[nix|windows|mac]line.separator"
, "false [ nix |mac| windows ] line.separator"
, "T[y|n]q"
, "T[y]"
, "false"
, "faLse"
, "true"
, "TrUe"
, "F"
, "T"
};
int iISArL = aISAr.length, i = 0;
// __
boolean IsLoop;
Ptrn = Pattern.compile(aRegEx, Pattern.CASE_INSENSITIVE);

System.err.println("// __ matching pattern: |" + aRegEx + "|");

Mtchr = Ptrn.matcher(aISAr); // get a matcher object
IsLoop = (i < iISArL);
while(IsLoop){
System.err.println("// __ |" + i + "|" + aISAr + "|");
iCnt = 0;
// __
while(Mtchr.find()){
iMtxStart = Mtchr.start();
iMtxEnd = Mtchr.end();
System.err.println("|" + iCnt + "|" + iMtxStart + "|" + iMtxEnd + "|" +
aISAr.substring(iMtxStart, iMtxEnd) + "|");
++iCnt;
}// (Mtchr.find())
System.err.println("~");
// __
++i;
IsLoop = (i < iISArL);
if(IsLoop){ Mtchr.reset(aISAr); }
}// while(IsLoop)
}
}
 
A

Arne Vajhøj

I need to set up some code's running context via properties files and I want
to make sure that users don't get too playful messing with them, because that
could alter results greatly and in unexpected ways (they must probably won't
be able to make sense of and then they would bother the hell out of you)
~
So, I must do some sanity check the running parameters if entered via the
command prompt or if the defaults are used from the properties files
~
I am telling you all of that because you many know of libraries to do such
thing
~
I think one possible way to do that is via a regexp, which should match all
the options included in the test array aISAr
~
One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be
actually matched another one is that, say, " true ", should be matched, as well
as "false [ nix |mac| windows ] line.separator" ...
~
Any ideas you would share?

I would do it as:
- switch from properties to XML
- define a schema for the XML with strict restrictions on data
- let the application parse that with a validating parser and
read it into some config object, this will ensure that required
information is there and that the data types are correct
- let the application apply business validation rules in Java code
on the config objects - this will ensure that the various
information is consistent

Arne
 
J

Joshua Cranmer ðŸ§

One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be
actually matched another one is that, say, " true ", should be matched, as well
as "false [ nix |mac| windows ] line.separator" ...

Do you know the syntax of Java's regular expressions? See
<http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html>.

In short, anything contained within square brackets is considered to be
a set of characters to match on, so [true|t] succeeds if the character
it's matching against is a t, r, u, e, or |. The syntax you probably
wanted was (true|t), which would either match the string "true" or the
string "t".
 
S

Stefan Ram

I am telling you all of that because you many know of libraries to do such
thing

The config class can be seen as a bean, and then bean
validation can be applied, possibly (I never used that).

http://docs.oracle.com/javaee/6/tutorial/doc/gircz.html
One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be

(?:true|t(?=[^r][^u][^e]))

(sketch, untested)
 
M

markspace

One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be
actually matched another one is that, say, " true ", should be matched, as well
as "false [ nix |mac| windows ] line.separator" ...
~
Any ideas you would share?
~


Based on your syntax example and you title, why bother with
"Backus-Naurish?" Java has full parser generators.

http://www.antlr.org/
 
R

Robert Klemme

Regexes are quite limited.

I beg to differ: it's amazing what you can do with them. Especially
modern RX engines are usually much more powerful than those needed for
the class of regular languages.
When you bang into their limits you can
write a finite state machine or use a parser.

What limitations would make me want to write a FSM instead by hand?

Cheers

robert
 
S

Stefan Ram

Robert Klemme said:
What limitations would make me want to write a FSM instead by hand?

It is a natural idea that the user may input simple
arithmetic expressions with numeric literals, basic
arithmetics, parentheses and algebraic signs when the
program asks for a numeric value.
 
R

Roedy Green

Examples where regexes run out of steam:
parsing Java, HTML, BAT language ... to do syntax colouring.
screen scraping, where what you want can appear in arbiter orders, be
missing, or enclosed in a variety of delimiters.

creating code to simulate the output of forms. You have to do it in
stages. You pick out a string then you pick out strings of that
 
R

Roedy Green

What limitations would make me want to write a FSM instead by hand?

Compacting out nugatory space in HTML would be another example.

Though they are quite complicated, I find FSMs very easy to write, and
they almost always work first time. You can narrow your thinking to a
tiny case and ignore the big picture quite safely.

In contrast, I find my regexes (of any complexity) nearly always have
some unexpected behaviour, often than does not show up immediately.

The other complicating factor is I use three different regex schemes
in a day: Java, Funduc and SlickEdit. I keep borrowing syntax from
one of the other schemes than the one I am using. Some day I will
have to write replacements that use Java syntax.
 
R

Robert Klemme

It is a natural idea that the user may input simple
arithmetic expressions with numeric literals, basic
arithmetics, parentheses and algebraic signs when the
program asks for a numeric value.

I am sorry but you are not answering the question.

Cheers

robert
 
R

Robert Klemme

Examples where regexes run out of steam:

I never said you can do anything with regexps. You said they are "quite
limited" to which I responded "I beg to differ: it's amazing what you
can do with them." I think you are talking completely past me.
parsing Java, HTML, BAT language ... to do syntax colouring.

For that you need a context free parser anyway and would not create a
FSM by hand.
screen scraping, where what you want can appear in arbiter orders, be
missing, or enclosed in a variety of delimiters.

Still, I haven't seen a single reason to create a FSM by hand.
creating code to simulate the output of forms. You have to do it in
stages. You pick out a string then you pick out strings of that

Regexps are for _parsing_ and not for _generating_.

Cheers

robert
 
R

Robert Klemme

Compacting out nugatory space in HTML would be another example.

There are tools for processing tag based languages. Why would I want to
create a FSM by hand for that?
Though they are quite complicated, I find FSMs very easy to write, and
they almost always work first time. You can narrow your thinking to a
tiny case and ignore the big picture quite safely.

Certainly you can write FSMs for a lot of things. But you were claiming
that a manual FSM should be used instead of a regexp engine; so the
question remains unanswered: why would anyone create a FSM by hand for
parsing?
In contrast, I find my regexes (of any complexity) nearly always have
some unexpected behaviour, often than does not show up immediately.

Well, that certainly depends on your familiarity with the tool. To me
this sounds suspiciously like NIH syndrome. I am so familiar with using
regular expressions of various kinds that it would not occur to me to
start writing a FSM for parsing by hand. That is such a waste of time.
The other complicating factor is I use three different regex schemes
in a day: Java, Funduc and SlickEdit. I keep borrowing syntax from
one of the other schemes than the one I am using.

And how exactly do you implement a FSM in SlickEdit?
Some day I will
have to write replacements that use Java syntax.

Not sure what you mean by that.

Cheers

robert
 
A

Arne Vajhøj

There are tools for processing tag based languages. Why would I want to
create a FSM by hand for that?


Certainly you can write FSMs for a lot of things. But you were claiming
that a manual FSM should be used instead of a regexp engine; so the
question remains unanswered: why would anyone create a FSM by hand for
parsing?

It sounds cool to claim to do so in a usenet thread!

:)
And how exactly do you implement a FSM in SlickEdit?


Not sure what you mean by that.

I think he is talking about writing a plugin with a 100%
Java compatible regex syntax.

Arne
 
J

Joshua Cranmer ðŸ§

Examples where regexes run out of steam:
parsing Java, HTML, BAT language ... to do syntax colouring.

Actually, all of those examples fall under the category of lexing, which
is very easy to do with regular expressions; the python equivalent of
flex uses regular expressions internally to do the lexing. Basically,
what you'd have to do is this:

1. For each token, compute the regex that matches the token and enclose
it in a named capturing group
2. Combine the token regexes into a single regex using disjunctions
3. Run the large regex on the input string by continually finding
matches until it runs out of them.
4. For each match, use the named capturing group to do actions for that
part of the input string.
screen scraping, where what you want can appear in arbiter orders, be
missing, or enclosed in a variety of delimiters.

([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically
for tokenizing. Note in particular that I am matching two separate kinds
of string literals ("foo" and [foo]). The hard part here is that I'm
dealing with an idiot language that made comment-parsing context-free,
but I decided to say "to hell with this" and ignore that fact, banking
that it's a rare edge case I never have to deal with.

Granted, such large regular expressions can become extremely unwieldly
(said regex is actually composed out of about five lines of code plus
detailed comments above each part explaining what it does), but it's
still very simple to do in a regex.
 
S

Stefan Ram

=?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?= said:
Actually, all of those examples fall under the category of lexing,

Parsing is not lexing, usually parsing comes after lexing.
 
E

Eric Sosman

[...]
([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically
for tokenizing. [...]

As Ed Post noted nearly thirty years ago:

It has been observed that a TECO command sequence
more closely resembles transmission line noise
than readable text.
-- "Real Programmers Don't Use PASCAL"

Nobody I know of uses TECO any more, but regexes satisfy
people's craving for gibberish.
 
A

Arne Vajhøj

[...]
([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically
for tokenizing. [...]

As Ed Post noted nearly thirty years ago:

It has been observed that a TECO command sequence
more closely resembles transmission line noise
than readable text.
-- "Real Programmers Don't Use PASCAL"

Nobody I know of uses TECO any more, but regexes satisfy
people's craving for gibberish.

$ edit/teco z.z
%Can't find file "Z.Z"
%Creating new file
*ex$$

:)

(sorry - the only thing I know about TECO is how to exit)

Arne
 
E

Eric Sosman

[...]
Nobody I know of uses TECO any more, but regexes satisfy
people's craving for gibberish.

$ edit/teco z.z
%Can't find file "Z.Z"
%Creating new file
*ex$$

:)

(sorry - the only thing I know about TECO is how to exit)

Perhaps the most important lesson of all! ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top