Is C an unsuitable choice as a string parser?

Malcolm McLean · Dec 16, 2013

On 12/16/2013 11:18 AM, Phil Carmody wrote:

...

Characters that have a special meaning for an RE engines are necessarily
more complicated to search for, but I don't know of any RE engine that
is completely incapable of doing so. Could you give an example? The ones

I know best can match any such character by preceding it with a '\'.

What he mean is that whilst we can use escapes to match an expression of
the form ( text ), we can't define a regular expression to match

expression = term + or - term
term = number or ( expression )
number = list of digits starting with non-zero

John Gordon · Dec 16, 2013

Characters that have a special meaning for an RE engines are necessarily
more complicated to search for, but I don't know of any RE engine that
is completely incapable of doing so. Could you give an example? The ones
I know best can match any such character by preceding it with a '\'.

I assume he was referring to RE's general inability to grok nesting, not
that RE's literally cannot match a ']' character.

Ken Brody · Dec 16, 2013

On 12/16/2013 12:19 PM, Malcolm McLean wrote:
[...]

What he mean is that whilst we can use escapes to match an expression of
the form ( text ), we can't define a regular expression to match

expression = term + or - term
term = number or ( expression )
number = list of digits starting with non-zero

So zero isn't a number? ;-)

(Yes, I know it was a quick-and-dirty example, and not a "real life"
description.)

James Kuyper · Dec 16, 2013

Then chose 'p' and 'q' as the bracketting characters, and try again.
No RE can parse the language of well-nested brackets, as that language
*isn't regular*.

That's a more complicated statement than your original; I wouldn't argue
with the more complicated version - but the original version overstated
your case.

Ben Bacarisse · Dec 16, 2013

James Kuyper said:
That's a more complicated statement than your original; I wouldn't argue
with the more complicated version - but the original version overstated
your case.

I'm a bit puzzled. Are you saying that this (the original statement):

| And RE's can't even match brackets.
| Isn't the first non-trivial grammer you learn to parse the well-nested
| brackets one?

is significantly different to the latter one? I can't see much in it
myself.

James Kuyper · Dec 16, 2013

I'm a bit puzzled. Are you saying that this (the original statement):

| And RE's can't even match brackets.
| Isn't the first non-trivial grammer you learn to parse the well-nested
| brackets one?

is significantly different to the latter one? I can't see much in it
myself.

Yes, the original says simply, without qualification, that RE's can't
match brackets. I'm quite familiar with ways of using RE's to match
brackets - I use vi REs quite frequently for precisely that purpose
while editing C source code - so I know that's wrong.

The more complicated version specifies parsing a language that uses
well-nested brackets. That's a much more complicated issue than simply
matching brackets. I'm sure that REs could be used as part of such a
parsing process: the system I'm most familiar with for parsing strings
are lex/yacc and flex/bison. I'm not sure of the significance of Phil's
usage of the phrase "well-nested", but I'm quite sure that either of
those systems can deal quite easily with nested brackets. lex and flex
are both built upon REs. However, REs play only a small roll in the
entire process, and I wouldn't dare to claim they could do the entire job.

Siri Cruz · Dec 16, 2013

Gallagher Polyn said:
Hey all,

(My recent post on this question on stackoverflow put on hold as being
'opinion-based':
stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-par
ser)

I am considering C as a candidate for implementing a string parser.

YACC and Lex are string parsers, written C and generating C code. I have also
written a GHR parser (an enhancement of CKY) in Algol 60 and C, and that can
parse with any type 2 grammar. So, yes, you can do it. I don't know offhand of a
type 1 grammar parser, and I don't know if there would a polynomial bounds
guarantee; a type 0 grammar parser would be problematic since there's no halting
guarantee.

My task feels simple and limited -- not bad for a C program -- but I keep
running across comments about C like 'prefer a language with first class
string support' (e.g., stackoverflow.com/a/8465083/3097472)

With a garbage collector such as the Boehm GC, you can easily create a string
library to your heart's delight. Boehm GC even supplies one for you. Apple's
LLVM Objective-C is supposed to automate reference counts for effective garbage
collection; I don't know if it handles circular lists or can be used for
straight C.

Any programming language can implement a Turing Machine has the same power as
any other. It's not what you can do, but what is the easiest notation.

Siri Cruz · Dec 16, 2013

James Kuyper said:
That's a more complicated statement than your original; I wouldn't argue
with the more complicated version - but the original version overstated
your case.

REs can't recognise Dyck sets.

Simple enough?

Ben Bacarisse · Dec 16, 2013

James Kuyper said:
Yes, the original says simply, without qualification, that RE's can't
match brackets.

You didn't take the second sentence as qualifying the first? You
thought Phil Carmody (with whose posts you can't be entirely unfamiliar)
was stating that (a) REs can't match the bracket /characters/ and, quite
independently with no connection or expository purpose, (b) speculated
on what simple grammars one learns? That seems so very odd that I feel
I must have misunderstood you.

Siri Cruz · Dec 16, 2013

James Kuyper <[email protected]> said:
matching brackets. I'm sure that REs could be used as part of such a
parsing process: the system I'm most familiar with for parsing strings

Well, yeah, that's already known. Any context free (type 2) language can be
described as the appropriate mapping of the union of a regular (type 3) language
and a Dyck set.

Ben Bacarisse · Dec 16, 2013

Phil Carmody said:
And RE's can't even match brackets.
Isn't the first non-trivial grammer you learn to parse the well-nested
brackets one?

At the risk of complicating matters, some RE systems and/or libraries
*can* match languages with well-nested brackets. Famously, recent
versions of Perl's "regular expressions" can do this, and therefore so
can systems that include libpcre like PHP. Of course, it's the name
that's wrong -- such patterns no longer match regular languages -- but
there is far too much history to change the name.

Siri Cruz · Dec 16, 2013

Ben Bacarisse said:
At the risk of complicating matters, some RE systems and/or libraries
*can* match languages with well-nested brackets. Famously, recent
versions of Perl's "regular expressions" can do this, and therefore so
can systems that include libpcre like PHP. Of course, it's the name
that's wrong -- such patterns no longer match regular languages -- but
there is far too much history to change the name.

Depending on what's allowed you can get away with a DPDA + one integer counter
without needing a full stack.

James Kuyper · Dec 16, 2013

You didn't take the second sentence as qualifying the first?

No, I took the second sentence as giving, as an example, a context where
the claim made by the first sentence causes a problem. That claim, if
true, would certainly cause problems in that context.

I know of several non-trivial grammars supporting nesting brackets, but
I have no idea which particular grammar he's thinking of. My formal
education was in theoretical physics; computer programming was
originally simply a supporting skill that somehow ended up becoming my
career. Therefore, I never took the kind of course he was referring to.

If the second sentence was meant to qualify the previous one, it should
have contained phrases identifying the way in which it qualified it.
However, I can't figure out how to modify that pair of sentences by
inserting such phrases, to mean what he meant by those sentences. It's
far easier to do a complete re-write:

"Grammars supporting well-nested brackets (such as the very first
non-trivial one you learned to parse) cannot be parsed entirely by the
use of RE's."

If that's not an accurate re-wording of his statement, that just proves
I still don't understand it.

Charlton Wilbur · Dec 17, 2013

JK> Yes, the original says simply, without qualification, that RE's
JK> can't match brackets. I'm quite familiar with ways of using RE's
JK> to match brackets - I use vi REs quite frequently for precisely
JK> that purpose while editing C source code - so I know that's
JK> wrong.

vi's "regular expressions" are more powerful than the mathematical
formalism "regular expressions."

Charlton

Keith Thompson · Dec 17, 2013

James Kuyper said:
No, I took the second sentence as giving, as an example, a context where
the claim made by the first sentence causes a problem. That claim, if
true, would certainly cause problems in that context.

I know of several non-trivial grammars supporting nesting brackets, but
I have no idea which particular grammar he's thinking of. My formal
education was in theoretical physics; computer programming was
originally simply a supporting skill that somehow ended up becoming my
career. Therefore, I never took the kind of course he was referring to.

If the second sentence was meant to qualify the previous one, it should
have contained phrases identifying the way in which it qualified it.
However, I can't figure out how to modify that pair of sentences by
inserting such phrases, to mean what he meant by those sentences. It's
far easier to do a complete re-write:

"Grammars supporting well-nested brackets (such as the very first
non-trivial one you learned to parse) cannot be parsed entirely by the
use of RE's."

If that's not an accurate re-wording of his statement, that just proves
I still don't understand it.

I believe it is, though of course I can't speak for him.

One possible interpretation is that "can't match brackets" means
that, given something like "(foo (bar) baz)", a regular expression
can't *match* the '(' at the beginning of the string to the ')'
at the end of it.

In any case, I hope we can all agree that the horse we're beating
is now well and truly dead.

glen herrmannsfeldt · Dec 17, 2013

Keith Thompson said:
(snip)

I believe it is, though of course I can't speak for him.

One possible interpretation is that "can't match brackets" means
that, given something like "(foo (bar) baz)", a regular expression
can't *match* the '(' at the beginning of the string to the ')'
at the end of it.

In any case, I hope we can all agree that the horse we're beating
is now well and truly dead.

OK, to get back to C, why are the arguments to preprocessor macros
required to be balanced, but not the replacements?

#define left (
#define right )
#define lb {
#define rb }
#define semi ;

#include <stdio.h>

int main left right lb semi
printf left "hi there!" right semi rb

Too much fun with the preprocessor.

(I didn't try writing macros with unbalanced () in the arguments.)

-- glen

Keith Thompson · Dec 17, 2013

glen herrmannsfeldt said:
OK, to get back to C, why are the arguments to preprocessor macros
required to be balanced, but not the replacements?

Arguments are required to be balanced only for macros that take
arguments (i.e., function-like macros). Without that requirement,
it would be impossible to detect the end of an invocation.

There are probably ways to work around that limitation, but I don't
think it would be worth the effort. You could pass a ( or ) token as a
macro argument, which is currently not possible, but personally I've
never felt the need to do that.

Phil Carmody · Dec 17, 2013

James Kuyper said:
That's a more complicated statement than your original; I wouldn't argue
with the more complicated version - but the original version overstated
your case.

I sometimes rely on people being able to read my mind, and perhaps filling
in gaps, or, worse, having to extrapolate. Apologies if there were ambiguities.

Phil

James Kuyper · Dec 17, 2013

I sometimes rely on people being able to read my mind, and perhaps filling
in gaps, or, worse, having to extrapolate. Apologies if there were ambiguities.

I now realize that you were probably talking about matching
corresponding brackets to each other, rather than writing an RE that
matches a single bracket. "match brackets" could mean either of those
things.

Malcolm McLean · Dec 17, 2013

On 12/17/2013 06:32 AM, Phil Carmody wrote:

I now realize that you were probably talking about matching
corresponding brackets to each other, rather than writing an RE that
matches a single bracket. "match brackets" could mean either of those
things.

Yes, it was a simple misunderstanding, and obvious to anyone who knows
that regular expressions aren't the same as formal grammars. Which you
wouldn't know unless you'd either been told or tried to build an RE
expression parser. So all it needed was a correction to clear it up.
I'm amazed at some of the posts.

newbee help : parser in C	21	Jun 19, 2008
Evaluating a string as C++ Code	17	Jul 30, 2011
PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
C as a Platonic pathology	88	Aug 16, 2009
New C operator -- would it be a good idea?	28	Sep 10, 2012
Specifying an API for a straeming parser	1	Dec 4, 2009
Somone's SO question: "Is there an existing library for dynamically-determineddimensional array in c	1	Dec 9, 2013
Programming "only" in an environment of C and it's friends - Being a toolsmith	51	May 27, 2012

Is C an unsuitable choice as a string parser?

Malcolm McLean

John Gordon

Ken Brody

James Kuyper

Ben Bacarisse

James Kuyper

Siri Cruz

Siri Cruz

Ben Bacarisse

Siri Cruz

Ben Bacarisse

Siri Cruz

James Kuyper

Charlton Wilbur

Keith Thompson

glen herrmannsfeldt

Keith Thompson

Phil Carmody

James Kuyper

Malcolm McLean

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads