Is C an unsuitable choice as a string parser?

Les Cargill · Dec 14, 2013

Malcolm said:
You frequently know that the program is going to run into severe time
constraints, for instance if you're displaying 3D graphics for a video game,

That's a pretty specialized sort of constraint. So other than the
three* people who have to actually deal with that.

*hyperbolically chosen figure used to illustrate the narrow
bandwidth of such a filter...

you'll want to add more and more geometry until you hit the frame rate and
the programs starts skipping frames. The more triangles on screen, the
better it looks. So in cases like that, using C right from the word go, at
least in the processor-intensive parts of the program, is a sensible choice.

See "fits the runtime environment better".

Gallagher Polyn · Dec 14, 2013

As the OP, I had been concerned not to digress too much from C into more detail on my purpose for the parser. Of course, I have been reading the ensuing responses with as much profit as my understanding allows. In hopes that it may resolve the basis for any irresolvable claims by the participants, let me say a little more...

I "23 tabby cats 2pm tomorrow Tarrytown"
II "23 teal teacups 2pm tomorrow Tarrytown"

Descriptions of cats and teacups matching the expressions above are encodedin a data store. For example, there could be descriptions of teacups and cats that distinguish them by how many are available or by different sets ofcolors or locations or available times. The parser's job is to to interpret strings like those above as they are entered and suggest matches as the user types. So if the user has typed "23 t", she may then be prompted to consider selecting between I and II with "23 tabby cats" and "23 teacups". In a sense, I seek to elicit form data from users with strings.

In one version of handling strings, which I would call the 'tokenizer' version, splitting the strings on word boundaries is a reasonable way to searchthe data store for matches.

In another version of handling strings, which I'll call the 'lexical' version, more grammatical versions of I and II could be handled, for example, "23 tabby cats tomorrow at 2pm in Tarrytown".

I anticipate the tokenizer version will be the best initial version, but I would like to retain the option of the lexical version. This is why I thought C, besides other possible benefits, could be reasonable: because it seems the easiest way to incorporate other resources like Python's NLTK or any other developed resources for lexical analysis not developed in C. (Of course, I have been warned against this in the preceding discussion.)

I'd like to retain the option of interpreting strings enterered in non-english, too. That is, I should presume internationalization to as many spoken languages as possible.

G

Seebs · Dec 14, 2013

There are no *three thousand* character, single strings in the wild.
There are vanishingly few *one thousand* character strings in the wild.

What on earth do you mean?

Now, if you put an entire webpage or document into one string, I'd
consider that more complex object *than* a string and build
scaffolding to deal with it. For one, you'll need to unmarshal
several read() calls against a socket or file.

How is it not "a string"? Sure, it might be assembled from other things,
but there are reasons to, at least sometimes, want to treat a huge hunk
of data as "a string".

-s

Les Cargill · Dec 14, 2013

Seebs said:
What on earth do you mean?

Most things in real life are constrained by things like Ethernet
MTU limitations...

How is it not "a string"?

Q: When is a string not a string?
A: When it's a buffer.

James Harris · Dec 14, 2013

Les Cargill said:
Most things in real life are constrained by things like Ethernet
MTU limitations...

Not quite. Ethernet is a data link technology. IP runs on top of data link
systems such as Ethernet and adds unreliable datagram communication. This
permits sizes up to 64k. More importantly, TCP runs on top of IP and adds
both reliability and a stream model. The stream model makes apps communicate
octet streams without regard to packet or datagram sizes. So any app which
uses TCP can pass long strings to other processes. It is not limited by the
1500 or 64k limits.

Note the Total Length field in the IP header

https://en.wikipedia.org/wiki/IPv4_header#Header

and the 32-bit sizes of the Syn and Ack fields in the TCP header

https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_segment_structure

The point is that Ethernet MTUs do not limit the sizes of strings that
programs can send to one another. UDP will not but TCP will split a very
long string into datagrams as needed.

James

Les Cargill · Dec 15, 2013

James said:
Not quite.

*Like* Ethernet MTU constraints... or UDP, or whatever. I could
have worded that better.

Ethernet is a data link technology. IP runs on top of data link
systems such as Ethernet and adds unreliable datagram communication. This
permits sizes up to 64k.

Badly. I would avoid this if I were you. Google for "Fragmentation
considered harmful"

I am sure there are PHY layers such that 64 k byte as an MTU
isn't a problem. At that point, go for it.

More importantly, TCP runs on top of IP and adds
both reliability and a stream model.

But this is *also* problematic. If you use blocking sockets, then
you are subject to the whims of whatever it is that is
unreliable between you and the far end.

If you use nonblocking sockets, you get a piece at a time
and get to do your own reassembly.

And when you use *blocking* sockets, you may well *hang*
on a read() ioctl() at embarrassing times. This may well
require a full-on reboot.

Hence "when it's a buffer"....

The stream model makes apps communicate
octet streams without regard to packet or datagram sizes.

Not in actuality.

So any app which
uses TCP can pass long strings to other processes. It is not limited by the
1500 or 64k limits.

SO you have two choices - either treat "unlimited stream size" as a
natural right, and then have to go fix it when this assumption fails,
or understand the lower layers and plan accordingly.

I know which one I do...

Note the Total Length field in the IP header

https://en.wikipedia.org/wiki/IPv4_header#Header

and the 32-bit sizes of the Syn and Ack fields in the TCP header

https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_segment_structure

Absolutely.

The point is that Ethernet MTUs do not limit the sizes of strings that
programs can send to one another. UDP will not but TCP will split a very
long string into datagrams as needed.

This entire line of argument *ignores* that I still have to wonder
why anybody needs 1 megabyte strings - that subsequently require
some sort of string manipulation - in the first place.

BartC · Dec 15, 2013

This entire line of argument *ignores* that I still have to wonder
why anybody needs 1 megabyte strings - that subsequently require
some sort of string manipulation - in the first place.

Why not? Once you have a comfortable way of dealing efficiently with strings
of any length (counted strings for example) then you will find all sorts of
uses for them.

And without the need for a zero-byte, they can also be used to store any
binary data.

And 1-million-character strings might be used to represent file contents, as
you mentioned, or to serialise large date structures, whatever, before
writing to a file, or perhaps adding them to something else, or just doing
whatever needs doing!

In fact the most difficult aspect is when you need to send the string to a
function that expects a zero-terminated one!

Les Cargill · Dec 15, 2013

BartC said:
Why not? Once you have a comfortable way of dealing efficiently with
strings of any length (counted strings for example) then you will
find all sorts of uses for them.

And without the need for a zero-byte, they can also be used to store
any binary data.

And 1-million-character strings might be used to represent file
contents, as you mentioned, or to serialise large date structures,
whatever, before writing to a file, or perhaps adding them to
something else, or just doing whatever needs doing!

In fact the most difficult aspect is when you need to send the string
to a function that expects a zero-terminated one!

"You're looney" - Graham Chapman.

BartC · Dec 15, 2013

"You're looney" - Graham Chapman.

By coincidence, someone posts a message about representing million-digit
numbers ("How to approach large numbers in programming puzzles"). My first
attempts at such a library, used strings, with a million-character string
representing a million-digit number.

Malcolm McLean · Dec 15, 2013

That's a pretty specialized sort of constraint. So other than the
three* people who have to actually deal with that.

The two common situations is that the computer performs calculations which
are their core are relatively trivial (e.g. performs income tax calculations
for a few thousand or even a few million people), and wraps that in nice
interfaces. Or the other common situation is that we have a problem which
can't really be solved by computer, e.g. computing shadows, but we can have a
nice stab at it (trace say five reflections from up to three light sources).
There are lots of problems in the second class, though they tend to require
a higher level of ability to program. It's not a job for that being a
Microsoft Certified Professional will qualify you for.

If the problem is in the second class, you almost certainly will run into
severe processing time constraints. Sometimes nowadays you can solve it
by throwing parallel processors at it, but that tends to be an expensive
and awkward solution. C is the obvious choice of programming language
whether you go for a parallel solution or not.

Edward A. Falk · Dec 15, 2013

I agree with everything you say until here. If the OP is
truly intent upon writing language parsers in C, writing
a RE parser would seem to be an appropriate first step.

I disagree. I've written dozens of basic parsers in C, but
I've never written an RE parser.

Edward A. Falk · Dec 15, 2013

IMHO, C is a *superb* language for a string parser -- if
you care about efficiency. It has all the right pieces for
easy scanning through strings and doing comparisons.

You pay a higher price during the intial coding, as you have to
build the data structures to hold your input buffers, readers,
tokens, and parser state. But once you have the basic infrastructure
for your parser down, the rest really flies.

If you want quick-n-dirty, or built-in string handling, then go with
Python, or maybe even Java. But for real production work, go with C.

Les Cargill · Dec 15, 2013

BartC said:
By coincidence, someone posts a message about representing million-digit
numbers ("How to approach large numbers in programming puzzles"). My
first attempts at such a library, used strings, with a million-character
string representing a million-digit number.

That's weird. I mean no snark, I was just having a failure of
imagination as to how that would be used.

Les Cargill · Dec 15, 2013

Malcolm said:
The two common situations is that the computer performs calculations which
are their core are relatively trivial (e.g. performs income tax calculations
for a few thousand or even a few million people), and wraps that in nice
interfaces. Or the other common situation is that we have a problem which
can't really be solved by computer, e.g. computing shadows, but we can have a
nice stab at it (trace say five reflections from up to three light sources).
There are lots of problems in the second class, though they tend to require
a higher level of ability to program.

It's most certainly more specialized.

Malcolm McLean · Dec 15, 2013

It's most certainly more specialized.

Not really. How many times have you seen people sat at a computerised device
which shows a 3D scene? Granted, most of them will have faked up shadows
with the current state of technology, but that's very rapidly changing.

Malcolm McLean · Dec 15, 2013

I would agree, but I am aware that many do not see it that way.

C allows you to read, assign to and test a letter with very straightforwards
syntax. A lot of more sophisticated languages have lost sight of those
elementary requirements. Though to be fair, if you move to UFT8 encoding,
it's no longer so easy in C.

J. Clarke · Dec 16, 2013

*Like* Ethernet MTU constraints... or UDP, or whatever. I could
have worded that better.

Badly. I would avoid this if I were you. Google for "Fragmentation
considered harmful"

I am sure there are PHY layers such that 64 k byte as an MTU
isn't a problem. At that point, go for it.

But this is *also* problematic. If you use blocking sockets, then
you are subject to the whims of whatever it is that is
unreliable between you and the far end.

If you use nonblocking sockets, you get a piece at a time
and get to do your own reassembly.

And when you use *blocking* sockets, you may well *hang*
on a read() ioctl() at embarrassing times. This may well
require a full-on reboot.

Hence "when it's a buffer"....

Not in actuality.

SO you have two choices - either treat "unlimited stream size" as a
natural right, and then have to go fix it when this assumption fails,
or understand the lower layers and plan accordingly.

I know which one I do...

This entire line of argument *ignores* that I still have to wonder
why anybody needs 1 megabyte strings - that subsequently require
some sort of string manipulation - in the first place.

"What is the longest palindromic sequence in the first million digits of
pi?"

I admit that it's a contrived problem.

Phil Carmody · Dec 16, 2013

Robert Wessel said:
Quite. I've never understood the fascination with RegEx engines for
basic tokenization. Usually like using a sledgehammer to kill flies.

And RE's can't even match brackets.
Isn't the first non-trivial grammer you learn to parse the well-nested brackets one?

Phil

James Kuyper · Dec 16, 2013

On 12/16/2013 11:18 AM, Phil Carmody wrote:
....

And RE's can't even match brackets.

Characters that have a special meaning for an RE engines are necessarily
more complicated to search for, but I don't know of any RE engine that
is completely incapable of doing so. Could you give an example? The ones
I know best can match any such character by preceding it with a '\'.

Phil Carmody · Dec 16, 2013

James Kuyper said:
On 12/16/2013 11:18 AM, Phil Carmody wrote:
...

Characters that have a special meaning for an RE engines are necessarily
more complicated to search for, but I don't know of any RE engine that
is completely incapable of doing so. Could you give an example? The ones
I know best can match any such character by preceding it with a '\'.

Then chose 'p' and 'q' as the bracketting characters, and try again.
No RE can parse the language of well-nested brackets, as that language
*isn't regular*.
http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

Phil

newbee help : parser in C	21	Jun 19, 2008
Evaluating a string as C++ Code	17	Jul 30, 2011
PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
C as a Platonic pathology	88	Aug 16, 2009
New C operator -- would it be a good idea?	28	Sep 10, 2012
Specifying an API for a straeming parser	1	Dec 4, 2009
Somone's SO question: "Is there an existing library for dynamically-determineddimensional array in c	1	Dec 9, 2013
Programming "only" in an environment of C and it's friends - Being a toolsmith	51	May 27, 2012

Is C an unsuitable choice as a string parser?

Les Cargill

Gallagher Polyn

Seebs

Les Cargill

James Harris

Les Cargill

BartC

Les Cargill

BartC

Malcolm McLean

Edward A. Falk

Edward A. Falk

Les Cargill

Les Cargill

Malcolm McLean

Malcolm McLean

J. Clarke

Phil Carmody

James Kuyper

Phil Carmody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads