Very simple parser... not for me

BartC · Jan 29, 2014

Richard Damon said:
Since in C, 'A' is totally identical to the value it represents,

case 'A' ... 'Z' meaning something other than

case (int)'A' ... (int)'Z'

I'm thinking about a special case where character constants are used to
specify each end of the range. If that's going to cause too many problems,
then alternate range operators will do, even if something like
letterrange('A','Z'); anything is better than having to write out 26 cases.
Or perhaps allowing case "ABCDEF":

would break that property.
On an EBCDIC machine the former would appear to omit the non-letters in
the range, while the latter would include them.

Sure, but how many EBCDIC machines are around? I don't think I've ever come
across any EBCDIC data (not since around 1977 or so). And in the 30+ years
I've been programming, I've used ASCII. I would have been put out, if I'd be
using C all that time, to have been denied the use of ranges as I've
described, even when no character codes were involved.

And then what do you do with

case 'A' ... 0x7F: ???

In this non-special case (where the author most probably has ASCII, ANSI,
UTF-8 or the bottom end of Unicode in mind), the range would be 65 to 127.
But how would this be expressed now in C?

James Kuyper · Jan 29, 2014

(I mentioned this was code-generated, the original was '0' .. '9', which is
expanded for the C version. Unfortunately the char constants are just mapped
to integers so are difficult to restore. And it was a lot of fiddly typing
to fix them all for this simple demo.

Perhaps if C allowed such a range in switch cases, then I would have had
less of an excuse,

if(nextchar >= '0' && nextchar <= '9')
{
}
else switch(nextchar)
{
}

It's a little more clumsy than specifying a range inside the switch, but
it has the advantage of being fully supported by the current version of
C, and it's a bit simpler than your current approach.

pozzugno · Jan 29, 2014

Is there something tricky I've not noticed? It seems to me

that a recognizer for these inputs would have rather few states,
if we permit the recognition of <c> and <g> as "primitives." For
reference, here's your grammar again:

#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
where
<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'.

... and a recognizer might go something like this:

Yes, you're right, it's simple. I was thinking to change the state
for each incoming character.

Anyway this is a ad-hoc implementation. If I change the sequence of
characters/fields admitted, the finite state machine implementation
could change dramatically.

I think it's better a "token approach", a function that parses the
string and generates a list of tokens #, *, <c> and <g> with the
associated values (for <c> and <g>). Depending on the state, I can
process different sequences of tokens and discard others as invalid.

Eric Sosman · Jan 29, 2014

Yes, you're right, it's simple. I was thinking to change the state
for each incoming character.

You could do that, too, if it seems easier. There'd be
more states, but they'd be simpler -- for example, "each state
consumes exactly one character" might be a helpful assertion.

Anyway this is a ad-hoc implementation. If I change the sequence of
characters/fields admitted, the finite state machine implementation
could change dramatically.

Well, yeah. If you switch to accepting JSON strings the
parser will have to change, too. What are the chances?

I think it's better a "token approach", a function that parses the
string and generates a list of tokens #, *, <c> and <g> with the
associated values (for <c> and <g>). Depending on the state, I can
process different sequences of tokens and discard others as invalid.

One thing that makes your parser particularly easy is the
fact that the grammar is so simple that it never requires any
look-ahead or backtracking: Given any (valid) prefix, you need
only examine the immediately following token to determine what
to do next. That eases the <c> and <g> recognition, because you
always know which one you're looking for. That is, you never
need to look at a '2' and wonder whether it's a <g> or just the
start of a <c>; you always know whether a <c> or <g> is due,
and the question becomes simply "Is this the Right Stuff?" In
parsing lingo your grammar can be recognized by an LR(0) parser
rather than a more complicated LR(1) or LR(k). (For completeness:
LR(k) is not the only family of parsing algorithms, just a very
popular type.)

Keith Thompson · Jan 29, 2014

BartC said:
Perhaps if C allowed such a range in switch cases, then I would have
had less of an excuse, but it's been discussed here a few times and
there are always arguments against it, mainly that it encourages
people to type 'A' .. 'Z', even though 99.99% of the time (or 100% of
the time for 99.99% of programs), that would work as expected.)

But it's wrong 100% of the time if your program runs on a system where
'A'..'Z' are not contiguous.

Correctness is not entirely about probabilities.

Keith Thompson · Jan 29, 2014

James Kuyper said:
if(nextchar >= '0' && nextchar <= '9')
{
}
else switch(nextchar)
{
}

Or

if (isdigit((unsigned char)nextchar)
{
/* ... */
}
else {
switch (nextchar) {
/* ... */
}
}

(Yes, the need for the cast is annoying.)

BartC · Jan 29, 2014

[reposted; may be duplicate]
From: "Keith Thompson" <[email protected]>
Sent: Wednesday, January 29, 2014 4:52 PM
Newsgroups: comp.lang.c
Subject: Re: Very simple parser... not for me

But it's wrong 100% of the time if your program runs on a system where
'A'..'Z' are not contiguous.

But it is worth worrying about to the extent that hundreds of thousands of
programmers are forced to write things like:

case 1: case 2: case 3: .... case 77:

for decade after decade, just because someone might one day compile the
source code on some IBM machine or whatever obscure hardware still uses
EBCDIC? And this is an example that is unrelated to character set encoding
so it wouldn't have made any difference.

Besides I suggested a few ways of making 'A'...'Z' work even with EBCDIC.

pozz · Jan 29, 2014

Il 29/01/2014 16:20, Eric Sosman ha scritto:

One thing that makes your parser particularly easy is the
fact that the grammar is so simple that it never requires any
look-ahead or backtracking: Given any (valid) prefix, you need
only examine the immediately following token to determine what
to do next. That eases the <c> and <g> recognition, because you
always know which one you're looking for. That is, you never
need to look at a '2' and wonder whether it's a <g> or just the
start of a <c>; you always know whether a <c> or <g> is due,
and the question becomes simply "Is this the Right Stuff?"

You are right for the combinations I listed in my original post. What I
was trying to do is to implement a more general parser that could be
modified in a simple way if I (or someone) decides to change the valid
combinations of tokens (I'm in a developing stage and I don't know
exactly what will be the valid combinations, they may change during
development of the product and this could depend on the customer too...
of course the keypad has only 0-9, # and * keys, so I will never change
the parser to understand JSON strings or similar complex grammars).

For example, what you wrote is right for the following combinations:
#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
What happens if the combinations would be the following:
#
*#
<g>*#
<c>*<g>#
<c>#
<c>*#
Now there are two combinations where <g> and <c> could appear at the
beginning of the string, so I have to look ahead to understand if I'm
reading a <g> field (just one char '1', '2' or '3') or a <c> field (a
numerical value of 4- to 6-digits).

BartC · Jan 29, 2014

What happens if the combinations would be the following:
#
*#
<g>*#
<c>*<g>#
<c>#
<c>*#
Now there are two combinations where <g> and <c> could appear at the
beginning of the string, so I have to look ahead to understand if I'm
reading a <g> field (just one char '1', '2' or '3') or a <c> field (a
numerical value of 4- to 6-digits).

You can try a token-based parser, so, you'd have the symbols: c, g, star,
hash and perhaps end-of-sequence and error. Or, more generally, digits, star
and hash, where you have to analyze digits to find if they are a g or c
field.

Introducing a one-character look-ahead might be simpler if the format isn't
going to get much more complicated than this.

However we don't know that; could g become long enough to get confused with
c? Could there be more different kinds of fields? Could repeated instances
of * and # be meaningful? Could you have one group of digits followed
immediately by another (delimited by a certain length perhaps), or could
there be consecutive g's (in which case "1331" might be one c, or four g's)?

glen herrmannsfeldt · Jan 30, 2014

Not that it helps in this case (where we're using a switch),
but the C standard does guarantee that '0', '1', etc, up to '9',
are consecutive numbers. Whereas there is no guarantee that, say,
'B' is one more than 'A'.

Well, that isn't so much of a problem, but 'J' might be more than
one more than 'I', and, even more, '}' might be between 'I' and 'J'.

-- glen

Keith Thompson · Jan 30, 2014

glen herrmannsfeldt said:
Well, that isn't so much of a problem, but 'J' might be more than
one more than 'I', and, even more, '}' might be between 'I' and 'J'.

In practice, yes. In principle, though, the C standard provides no
guarantees about the values of the 26 uppercase letters (beyond the
guarantees it provides for all printable characters in the basic
character set).

In the real world, the vast majority of computer systems (of those that
use character sets at all) use character sets based on ASCII, in which
each of the ranges '0'..'9', 'A'..'Z', 'a'..'z' is contiguous.

The vast majority of non-ASCII systems use some form of EBCDCIC, which
has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J',
and between 'R' and 'S'.

I've never heard of a C implementation using any other character set.

The C standard probably could have gotten away with, for example,
mandating that 'a'..'f' and 'A'..'F' are contiguous, or that the letters
are in order even if they're not contiguous ('a' < 'z', etc.). At the
time the standard was being written, it was less clear than it is now
that ASCII-based systems were going to take over the world, with EBCDIC
as the only significant competitor. (And if I had a time machine, I
might go back and encourage IBM to adopt ASCII, which was developed at
about the same time as EBCDIC.)

But there's another point to consider. If you write something like

if (c >= 'a' && c <= 'z')

even if it never runs on an EBCDIC-based system, you're still ignoring
everything other than the 26 letters of the Latin alphabet. (I count
7689 occurrences of "LETTER" in a list of Unicode characters.)

glen herrmannsfeldt · Jan 30, 2014

(snip)

In the real world, the vast majority of computer systems (of those that
use character sets at all) use character sets based on ASCII, in which
each of the ranges '0'..'9', 'A'..'Z', 'a'..'z' is contiguous.

The vast majority of non-ASCII systems use some form of EBCDCIC, which
has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J',
and between 'R' and 'S'.

I've never heard of a C implementation using any other character set.

The C standard probably could have gotten away with, for example,
mandating that 'a'..'f' and 'A'..'F' are contiguous, or that the letters
are in order even if they're not contiguous ('a' < 'z', etc.). At the
time the standard was being written, it was less clear than it is now
that ASCII-based systems were going to take over the world, with EBCDIC
as the only significant competitor. (And if I had a time machine, I
might go back and encourage IBM to adopt ASCII, which was developed at
about the same time as EBCDIC.)

For S/360, IBM tried to get ASCII-8, which isn't just regular
ASCII (I think sometimes called ASCII-7) with an extra bit, but
with half the table moved over. But it never happened.

But there's another point to consider. If you write something like

if (c >= 'a' && c <= 'z')

even if it never runs on an EBCDIC-based system, you're still ignoring
everything other than the 26 letters of the Latin alphabet. (I count
7689 occurrences of "LETTER" in a list of Unicode characters.)

Well, in Java you can actually use those letters, for example, as
variable names. You need a UTF8 or Unicode editor, though.

I did try it once just to see it. There is a free, downloadable
for Windows Unicode editor that will write files in a form that
javac can read.

And, as with C, there are methods to indicate which characters
are letter, digit, etc. (There are more digit, too.)

-- glen

Eric Sosman · Jan 30, 2014

Il 29/01/2014 16:20, Eric Sosman ha scritto:

You are right for the combinations I listed in my original post. What I
was trying to do is to implement a more general parser that could be
modified in a simple way if I (or someone) decides to change the valid
combinations of tokens (I'm in a developing stage and I don't know
exactly what will be the valid combinations, they may change during
development of the product and this could depend on the customer too...
of course the keypad has only 0-9, # and * keys, so I will never change
the parser to understand JSON strings or similar complex grammars).

Only you can guess how complex the grammar might someday
become, and whether it might grow gnarly enough to justify pulling
out all the machinery of lex and yacc or their brethren. But for
the teeny-tiny grammar at hand, and for the teeny-tiny repertoire
of symbols and roles, my counsel would be "YAGNI." Or "KISS."

For example, what you wrote is right for the following combinations:
#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
What happens if the combinations would be the following:
#
*#
<g>*#
<c>*<g>#
<c>#
<c>*#
Now there are two combinations where <g> and <c> could appear at the
beginning of the string, so I have to look ahead to understand if I'm
reading a <g> field (just one char '1', '2' or '3') or a <c> field (a
numerical value of 4- to 6-digits).

Yeah. Note also that if you're getting the input one
button-press at a time, the revised grammar makes it impossible
to act upon a leading <g> until the next button is pressed,[*]
no matter what kind of parser you use. So: If "button-by-button
response" is an important feature (you haven't described the
application) you simply *will not* make this modification. And
if you *do* make it, you still have a simple grammar for which
plugging in a new parser will be equally simple. (Look a bit at
the parser I showed you; observe that it's not "random logic" but
a systematic enumeration of prefixes. You can learn to do this
yourself with little effort.)

[*] There's an old story about a fellow who wanted to attend
a lecture by an eminent speaker. The lecture, though, was to be
delivered in German, which the listener didn't know. But he was
so keen to attend that he arranged for a simultaneous translation:
He hired a translator, bought him a second lecture ticket, and
anticipated hearing the Great Man's German while the English
equivalent was whispered in his ear. So the guy and his translator
take their seats, the speaker takes the stage, and the lecture
begins -- and the translator is silent. The speaker goes on and
on and on, and the translator says nothing. "Psst, hey!" says the
increasingly agitated attendee. "What's he saying?" "Patience,
Sir," says the translator, "I am waiting for the verb."

glen herrmannsfeldt · Jan 30, 2014

(snip on ASCII and IBM)

IBM actually planned to use ASCII* on the S/360, the machines even had
a mode bit to alter the operation of the handful of instructions that
had character set dependencies.

Several events conspired against that plan. The ASCII standard was
late, the unit record peripherals (printers, card readers, etc.) with
ASCII support were late, so the OS guys got started using EBCDIC, then
the whole OS/360 development fiasco happened, and there was no time to
do anything but desperately try and get OS/360 into any sort of
condition that you could ship it to customers, and so EBCDIC was cast
in stone for OS/360.

*Actually a bit of a reordered variant, but probably close enough.

Yes, the seem to call it ASCII-8.

But it isn't close enough if you are copying files back and forth.
There would have been fewer problems with translation tables, though.

Among others, EBCDIC has a logical NOT sign, used in PL/I for
the logical and relational operators. (I think some other languages,
such as REXX, also use it.) That complicates the translation
tables for ordinary text files.

-- glen

Kaz Kylheku · Jan 31, 2014

But there's another point to consider. If you write something like

if (c >= 'a' && c <= 'z')

even if it never runs on an EBCDIC-based system, you're still ignoring
everything other than the 26 letters of the Latin alphabet. (I count
7689 occurrences of "LETTER" in a list of Unicode characters.)

If you use something like

isalpha(ch);

you're also ignoring everything other than the 26 letters of the Latin
alphabet, unless you called setlocale somewhere.

Malcolm McLean · Feb 1, 2014

If you use something like
isalpha(ch);

you're also ignoring everything other than the 26 letters of the Latin
alphabet, unless you called setlocale somewhere.

Not necessarily ignoring. Treating as non-alphabetical.
Sometimes that's right, sometimes it's wrong, quite often it's not
well-defined, people often specify "alphanumerical characters" without
being clear about which alphabets are allowed, or whether decorated
Latin characters are the same as or different to their base letters.
If you work in an English-only environment, it's easy to shunt that sort
of issue into a siding, because you don't know or understand what non-English
speakers might want, and you're not sure what language these putative
non-Anglophones will speak.

Tim Rentsch · Feb 7, 2014

pozz said:
I have a string that is generated from a numerical keypad with
decimal numbers 0-9 and the two special characters * and #,
similar to a phone keypad.

I want to parse the string and make something or something else
depending on the current state and the string itself. Generally
the string can be:
#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
where
<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'.

Depending on the current state, some combinations are valid,
others not. In some cases the code should be valid (already saved
in the system), in other cases it could be new (when a new code
must be entered to change the current one).

I think I have solved the problem trying to split the string in to
three fields: <c>, '*' and <g>. If the first char is a digit, it
must be a 4- to 6- code, otherwise it must be a '*' or a '#', and
so on...

Even if my code works, I don't think this is an elegant solution.
What happens if in the feature I decide to change one combination?
For example, from "#" to "###"? Or from "<c>*#" to "*<c>#"? I
have to recode everything, because my assumption the string can be
split in three fields in that order is not true anymore.

What is the best approach to implement a simple parser like this?

One approach:

1. Devise a data structure that can represent all the
different patterns you have to recognize;

2. Write a function that can scan a string and see if
it matches a particular pattern;

3. Make an instance of the pattern structure for each
of your allowed patterns;

4. Call the scanning function repeatedly, for each pattern,
until a match is found.

For example, we might represent a pattern as a sequence (that is,
an array) of "fields". A field is a set of allowed characters
and a minimum and maximum number of characters allowed in the
field:

typedef struct {
const char *allowed_set;
unsigned char m, n;
} Field;

The final Field of a pattern has 'allowed_set' == 0.

Now we can make an array to hold pointers to the first
element of each legal pattern:

static Field *patterns[] = {
(Field[]){ {"*", 1,1 }, {"123", 1,1 }, {"#", 1,1 }, 0},
(Field[]){ {"*", 1,1 }, {"#", 1,1 }, 0},
(Field[]){ {"#", 1,1 }, 0},
(Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"123", 1,1}, {"#", 1,1}, 0},
(Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"#", 1,1}, 0},
(Field[]){ {"0123456789", 4,6}, {"#", 1,1}, 0},
};

A function to scan a string and see if it matches a given pattern
can be written very concisely by using tail recursion:

int
matches_string( Field f[], const char *s ){
if( ! f->allowed_set ) return *s == 0;
size_t k = strspn( s, f->allowed_set );
return k < f->m || k > f->n ? 0 : matches_string( f+1, s+k );
}

(Incidentally I am using the standard library function strspn()
here, but if it isn't available it should be easy enough to write
a replacement - certainly no more than 10 lines of code.)

Putting all this together, here is a program that looks at
each of its arguments and reports which pattern is matched:

#include <string.h>
#include <stdio.h>

typedef struct {
const char *allowed_set;
unsigned char m, n;
} Field;

static Field *patterns[] = {
(Field[]){ {"*", 1,1 }, {"123", 1,1 }, {"#", 1,1 }, 0},
(Field[]){ {"*", 1,1 }, {"#", 1,1 }, 0},
(Field[]){ {"#", 1,1 }, 0},
(Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"123", 1,1}, {"#", 1,1}, 0},
(Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"#", 1,1}, 0},
(Field[]){ {"0123456789", 4,6}, {"#", 1,1}, 0},
};

int which_pattern_matches( const char * );
int matches_string( Field f[], const char *s );

int
main( int argc, char *argv[] ){
int i;
for( i = 1; i < argc; i++ ){
int k = which_pattern_matches( argv );
printf( "Argument %15s matches pattern %3d\n", argv, k );
}
return 0;
}

int
which_pattern_matches( const char *string ){
for( unsigned n = 0; n < sizeof patterns / sizeof patterns[0]; n++ ){
if( matches_string( patterns[n], string ) ) return n;
}
return -1;
}

int
matches_string( Field f[], const char *s ){
if( ! f->allowed_set ) return *s == 0;
size_t k = strspn( s, f->allowed_set );
return k < f->m || k > f->n ? 0 : matches_string( f+1, s+k );
}

A sample compile/run command:

gcc -std=c99 -pedantic example.c && a.out '#' '*#' '*1#' '01234*2#' 'foo'

By the way, it might be useful for the scanning routine to
remember the start of each field as it goes. That can be
done by revising matches_string to take an extra parameter,
the address of a buffer into which to record the field start
locations:

int
matches_string( Field f[], const char *s, const char *r[] ){
... similar to above, with one extra line ...
}

I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
C++ Student, need help with very simple concept	2	Mar 13, 2023
Pls someone help me	3	Oct 18, 2024
How to implement a html parser in java?	1	Dec 28, 2023
A very simple parser with scanf & C	49	Sep 23, 2011
Can anyone help me code a simple python code?	1	Mar 13, 2022
Please, help me.	1	Jun 30, 2021
Why is Python telling me variable is local not global?	3	Sep 2, 2023

Very simple parser... not for me

BartC

James Kuyper

pozzugno

Eric Sosman

Keith Thompson

Keith Thompson

BartC

pozz

BartC

glen herrmannsfeldt

Keith Thompson

glen herrmannsfeldt

Eric Sosman

glen herrmannsfeldt

Kaz Kylheku

Malcolm McLean

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads