On 21 Oct 2011 08:58:55 GMT, Jorgen Grahn <
[email protected]>
wrote:
I think it's a bit more complicated than that. Assuming some
line-oriented input format:
- It's usually wrong to set a limit (say, 8192 bytes) and pretend
anything longer is two or more lines.
I'd say always wrong, or very nearly always.
- It's usually wrong to accept *any* length by using malloc()/realloc(),
and then start swapping and crashing when someone feeds you a
10GB line.
I assume a decent awk sets some limit (far higher than any reasonable
input) and exits gracefully with an error message when it gets
anything larger.
I recently had occasion to do some testing on this.
gawk on Linux died around 1G with an error message clearly indicating
that reallocation failed (as expected for a 2G-ish address space).
gawk on mingw failed somewhere in the millions, which on experiment
was confirmed to be very nearly the same size as a program that just
realloc'ed one buffer upward; presumably this has to do with Windows
address space as seen by mingw and I didn't investigate in detail.
Other awks on several Solarises*, HPUX, and an elderly AIX, all died
in the thousands, except that Solaris most-ancient 'oawk' did the very
bad thing of *breaking* lines over some length IIRC about 500.
(* I think the plural should actually be Solares IIRC my high-school
Latin, but I doubt anybody cares.)
- Many file/data formats put a limit to the line length anyway.
NNTP, SMTP. IIRC also the C language says a compiler is allowed
to bail out on lines longer than N characters.
Although NNTP and SMTP, at least, limit line lengths 'on the wire'
after encoding. From early on there were lots of attempts at encodings
allowing 'real' lines to be longer than the transmitted ones, although
none really successful until MIME's QP and B64.
A C implementation is not required to support more than 4095 chars
(509 in C89) in a 'logical' source line (after backslash splicing) nor
more than 254 per line in *text* files at runtime. The basic source
charset doesn't include newline, but some end-of-line 'indicator' is
"treat[ed] ... as if ... newline", so it's not clear if the 4095
includes that. The basic execution charset includes newline, which
terminates lines in text files, and is included in the 254.
In the 70s and into the 80s when C was developed (and most of the
major Internet protocols also) there were lots of important systems
and especially filesystems that had definite limits on the sizes of
records, which conventionally mapped to lines of text (although a C
implementor could break that mapping if they had to).
The Tanpaqard NonStop defined a rather odd format for text files
(officially EDIT format, commonly called code-101 because that's how
it is identified in the directory) which compresses runs of spaces and
can handle actual line lengths from about 239 with no spaces to over
3000 with many spaces. This was created before C existed, and when
Tandem implemented C (belatedly, right around '88) they added
'C-format' files (bag-o-bytes with NL) as code-180, and explicitly
noted in their manual that reading EDIT files, although supported as
an extension (and extremely useful to sites with lots of those files)
was not quite exactly conforming.