Is C an unsuitable choice as a string parser?

J

Jorgen Grahn

*Like* Ethernet MTU constraints... or UDP, or whatever. I could
have worded that better.

I would have said "in most programs there is a limit to the string
lengths you have to tolerate". But I suspect that's C's fault more
than anything else.

People want to say something like 'char buf[5000]' and get away with
it. That includes me -- I don't want to optimize for rare and silly
scenarios every time I read a string.
Badly. I would avoid this if I were you. Google for "Fragmentation
considered harmful"

Indeed, and that's why UDP-based protocols tend to try to keep the
datagram sizes down. (And TCP does it for you.)
I am sure there are PHY layers such that 64 k byte as an MTU
isn't a problem. At that point, go for it.


But this is *also* problematic. If you use blocking sockets, then
you are subject to the whims of whatever it is that is
unreliable between you and the far end.

If you use nonblocking sockets, you get a piece at a time
and get to do your own reassembly.

Reassembly in a /different/ sense though. You get whatever happens to
have queued up at that point in time, and that may be less /or/ more
than what you need to act.
And when you use *blocking* sockets, you may well *hang*
on a read() ioctl() at embarrassing times. This may well
require a full-on reboot.

Hence "when it's a buffer"....


Not in actuality.

Yes, in actuality. It's just that vaguely similar problems will pop
up in the application layer.
SO you have two choices - either treat "unlimited stream size" as a
natural right, and then have to go fix it when this assumption fails,
or understand the lower layers and plan accordingly.

I know which one I do...

Well, take a HTTP server for example. It sits waiting for something
like this on a TCP socket:

GET HTTP/1.1 http://example.org
Host: example.org
Lots more things: ...
_

and only when the last empty line arrives it may act on it.

I don't think the HTTP RFC puts a limit to the line lengths, or the
total size of the request -- but in reality it would be foolish to
allow a client to sit for hours feeding in more and more data; the
only valid reason to do so is a DoS attack.

So yes, I agree that it's usually silly to handle a multi-megabyte
string. But the lower layers are not the reason.

/Jorgen
 
M

Malcolm McLean

On Sun, 2013-12-15, Les Cargill wrote:

People want to say something like 'char buf[5000]' and get away with
it. That includes me -- I don't want to optimize for rare and silly
scenarios every time I read a string.
You have to be sure you're not opening a security hole for an exploit.
In a lot of programming environments, it's not an issue. But where it is,
the consequences can be serious.
 
S

Seebs

I don't think the HTTP RFC puts a limit to the line lengths, or the
total size of the request -- but in reality it would be foolish to
allow a client to sit for hours feeding in more and more data; the
only valid reason to do so is a DoS attack.

Uh.

Uploads. I have used more than one page which allows file uploads,
and those are implemented as HTTP requests. Pretty sure that can in
at least some cases imply an HTTP request which is in fact going
to be feeding in data for a long time, and if there's a slow link,
that could be minutes, certainly.
So yes, I agree that it's usually silly to handle a multi-megabyte
string. But the lower layers are not the reason.

The key word is "usually".

-s
 
M

Michael Angelo Ravera

Hey all,
(My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)
I am considering C as a candidate for implementing a string parser.
+ first specialized on English, but can be extended to parse arbitrarily any character encoding
+ tokenizes strings which may then be used to search a data store
+ allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power
My task feels simple and limited -- not bad for a C program -- but I keeprunning across comments about C like 'prefer a language with first class string support' (e.g., stackoverflow.com/a/8465083/3097472)
I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...
Can readers suggest any prima facie reason(s) not to use C as a string parser?

Other than local integration issues, such as ability to build a library written in C into programs where the top-level language is something else, C serves as a fine language with which to build a string parser. strtok(), strpbk(), and strspn() and their various updated functions are all designed tohelp make string parsing easy. And you can write your own adaptations witha bunch of boolean tables and they can be made to perform very fast.

But, everything depends upon your goals of the eventual solution and the implementation.

If, for instance, you are going to write your tokenized strings into a database file or put every token on a separate line of a file or write a CSV file and then process that file, C won't much help the speed or clarity of your overall solution. You might as well use some language like perl or awk or even PHP to do your parsing, if parsing is just the first step and havingthe result in memory when you are done won't be a big advantage.

Basically, the speed you gain from C will be FAR overshadowed by I/O considerations, unless you can work with the result of the parsing in memory.
 
E

Edward A. Falk

But I have little doubt that there will be cases in which
either will outperform the other. "I use 'C' because it's
faster" is pretty weak tea and is possibly a signal
of premature optimization. ...

It depends. Sometimes, you really do need to optimize, often
right from the start. Other posters in this thread have given
some real examples.

And frankly, string parser is a pretty good example. A typical
use for a string parser is in processing inputs to databases.
Possibly large databases. Possibly *very* large databases.

In fact, take out the word "possibly" here. If your parser is
going to receive any kind of broad distribution, you can pretty
much guarantee that it's going to eventually used on a big
dataset.

I once worked at a small web startup that had written everything
in Ruby. When the database passed ten million records, things started
to get bogged down pretty badly, and throwing more servers at the
problem was getting expensive.

We did a rough cost-benefit analysis, and the rule of thumb we
came up with was that once you passed 100 servers, it was better
to re-code in C than to keep adding more and more servers.
 
J

Jorgen Grahn

On Sun, 2013-12-15, Les Cargill wrote:

People want to say something like 'char buf[5000]' and get away with
it. That includes me -- I don't want to optimize for rare and silly
scenarios every time I read a string.
You have to be sure you're not opening a security hole for an exploit.
In a lot of programming environments, it's not an issue. But where it is,
the consequences can be serious.

Yes, of course. I'm assuming an interface where you (a) are explicit
about the length of your buffer and (b) can detect if it wasn't really
long enough. And that (c) you have an explicit plan for what to do in
that rare case.

/Jorgen
 
J

Jorgen Grahn

Uh.

Uploads. I have used more than one page which allows file uploads,
and those are implemented as HTTP requests. Pretty sure that can in
at least some cases imply an HTTP request which is in fact going
to be feeding in data for a long time, and if there's a slow link,
that could be minutes, certainly.

Of course. I was oversimplifying. There's a difference between the
payload of the request (which doesn't really have record boundaries,
and can be pipelined) and the contents of the HTTP headers (which have
to be stored until they are complete, more or less, and the actual
data transfer may begin).

It's easy to write a HTTP client which establishes a TCP connection,
sends part of a request, then disappears without a trace. Multiply
that by 1000 or more, and you have a nice low-cost denial of service
attack. (Admittedly, you don't need long strings for that.)

/Jorgen
 
E

Edward A. Falk

On Sun, 2013-12-15, Les Cargill wrote:

People want to say something like 'char buf[5000]' and get away with
it. That includes me -- I don't want to optimize for rare and silly
scenarios every time I read a string.
You have to be sure you're not opening a security hole for an exploit.
In a lot of programming environments, it's not an issue. But where it is,
the consequences can be serious.

Yes; I like that the GNU compiler will warn you about some unsafe
practices. Buffer overflow is insidious.

Case in point: I was once the subject of a CERT advisory when the
San Diego Supercomputer Center discovered an exploit in a simple
configuration utility I had written. (In my defense, the vulnerability
was in some code I had copy-and-pasted from someone else's configuration
utility.) After that, I started to take security seriously, and even
attended DefCon once to see what I could learn.

Case in point: I was once tasked with hardening security on a friend's
web site after it had Once Again, been broken into by script kiddies.
The vulnerabilities I found in the ftp daemon made me blanch.

If I'm reviewing someone else's code, and I see something like
"char buf[5000]", alarm bells go off.

Buffer overflows. Not even once.
 
W

wpihughes


In my experience embarrassingly parallel problems are quite common.
I find the application of a simple process pool neither expensive nor
awkward.

William Hughes
 
J

Jorgen Grahn

On Sun, 2013-12-15, Les Cargill wrote:

People want to say something like 'char buf[5000]' and get away with
it. That includes me -- I don't want to optimize for rare and silly
scenarios every time I read a string.
You have to be sure you're not opening a security hole for an exploit.
In a lot of programming environments, it's not an issue. But where it is,
the consequences can be serious.

Yes; I like that the GNU compiler will warn you about some unsafe
practices. Buffer overflow is insidious.

Case in point: ....

If I'm reviewing someone else's code, and I see something like
"char buf[5000]", alarm bells go off.

That's certainly a place in the code you need to examine, but what I'm
arguing is it doesn't have to be a bug. If e.g. you document "input
lines may not be larger than 4999 characters or the program will abort
with an error message" it's fair and sane and noone will complain.
(Assuming of course that you don't introduce an overflow.)
Buffer overflows. Not even once.

Yes, but not accepting infinite inputs and buffer overflows are
separate issues.

/Jorgen
 
M

Malcolm McLean

If I'm reviewing someone else's code, and I see something like
"char buf[5000]", alarm bells go off.

That's certainly a place in the code you need to examine, but what I'm
arguing is it doesn't have to be a bug. If e.g. you document "input
lines may not be larger than 4999 characters or the program will abort
with an error message" it's fair and sane and noone will complain.
(Assuming of course that you don't introduce an overflow.)

Buffer overflows. Not even once.

Yes, but not accepting infinite inputs and buffer overflows are
separate issues.
It's better to get into the way of thinking that the program will perform the
calculation unless it runs out of memory. However sometimes you have to worry
about resource denial - less of an issue with C programs because if the user
can run an arbitrary C program he can also easily hog every resource the OS
allocates to him, but still maybe a problem is programs are being run from
automatic processes. Then sometimes legitimate over-sized input is so unlikely
that it's better to throw it out as obviously either malicious or corrupt.

But generally I'd use a "getline" function rather than a really big buffer.
 
M

Malcolm McLean

C allows you to manipulate chars easily. But once you look at it closer
(Unicode standard is a good start), a "letter" is quite a different matter.
UTF8 is just the start (where code points consist of 1 to 4 chars). But then
you have letters that are made up from multiple code points.
Sure. In some situations letters with accents are the same letter as letters
without, in other situations they are considered to be different letters.
It's a kind of inherent difficulty. English just happens to be quite computer
friendly, also it's the language the standards were originally designed for,
so conventions (e.g. how to represent capitals, are 0 and O the same or
different, are open and close quotes the same or different, are double
quotes characters in their own right or concatenated single quotes etc)
are quite well established.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top