fgets() and embedded null characters

Dave Vandervies · Mar 17, 2005

:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.

How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.

dave

Eric Sosman · Mar 17, 2005

Dave said:
How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.

<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>

Chris Croughton · Mar 17, 2005

:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.

No, it simply meant that if you had CR followed by x then you needed to
delay sending x (often by inserting NUL), so printing LF CR x would have
needed to do it as LF CR NUL x whereas you could do CR LF x and save
time. Some systems would even adjust the number of NUL characters
according to the length of the previous line, so you could get:

1 CR LF
123 CR LF
1234567890ASDFGHJK CR LF NUL
1234567890QWERTYUIOPASDFGHJKLZXCVBNM1234567890 CR LF NUL NUL

etc. Mechanisms often jammed or did strange things if you tried to
print too fast (I remember printing words backwards during the carriage
return period when I got it wrong), there was no "flow control" on most
teleprinters in that sense (DC1 through DC3 were used for paper tape
control, mostly, so that the computer could control the tape reader and
punch, they weren't sent automatically to the computer).

There were also printers where the carriage could print in both
directions, so you could either wait for the carriage to return to the
left or you could send a "reverse direction" code ans start printing
backwards. For that the computer had to do the buffering itself and
output characters in reverse order...

Chris C

Keith Thompson · Mar 17, 2005

:Sure, because stdin is a text stream, not a binary stream. If you
:want to read binary data on stdin, you *might* be able to use
:freopen(). It's implementation-defined whether this is allowed

freopen() silently ignores failures to close the existing file,
and always opens the new file provided that appropriate access
exists (and the file exists or as appropriate.) freopen() does
a full close() first.

I suspect you may have been thinking of fdopen() instead of freopen().

No, I was thinking of freopen(); since fdopen() isn't standard C, I
probably wouldn't have mentioned it here. On the other hand, fdopen()
might be a solution, though not a 100% portable one. On the other
other hand, I've already exceeded the limits of my expertise, so
perhaps I'll just stop talking now.

Keith Thompson · Mar 17, 2005

Eric Sosman said:
<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>

<still-off-topic>
Unix tty software still supports this kind of thing. "man tty" and/or
"man termio" for details.
</still-off-topic>

Peter Nilsson · Mar 17, 2005

Keith said:
I don't recall any such discussion here. Can you provide a citation?
Could you have misinterpreted something?

Paul Hsieh has even accused Dave Thompson of being unconstructive.

Need I say more...

Joe Wright · Mar 17, 2005

Eric said:
<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>

<still off-topic>
But the '\0' itself was completely ignored. Blank Tape. '\177' (DEL) or
"all holes punched" was also completely ignored. I see no valid case for
NUL ('\0') in any text file. If it does exist, the I/O system should
ignore it in text mode. fgets() should never see '\0' in a text stream.
</still off-topic>

David Mathog · Mar 18, 2005

Joe said:
<still off-topic>
But the '\0' itself was completely ignored. Blank Tape. '\177' (DEL) or
"all holes punched" was also completely ignored. I see no valid case for
NUL ('\0') in any text file. If it does exist, the I/O system should
ignore it in text mode. fgets() should never see '\0' in a text stream.
</still off-topic>

Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to me.
The first indication that you've got one is when "cat" works but
"grep" won't find any of the words that "cat" shows. On a Windows
system these act just like a text file: notepad, wordpad, and
DOS level TYPE and FIND all show the same thing, with no overt
indication that the file contains unicode.

Run "od -c" on a unicode file and you'll find that it starts with
a Byte Order Mark (FE FF). After that every other byte is null.
Take out the BOM and the null characters and you've got an ASCII file,
assuming it was originally written in english. Somewhat ironically
all of the ones I've seen so far have only LF EOLs after being processed
like this. This is for UTF-16. There's also
UTF-32 but thankfully nobody has sent me one of those yet.

Regards,

David Mathog
(e-mail address removed)

CBFalconer · Mar 18, 2005

Peter said:
.... snip ...

Paul Hsieh has even accused Dave Thompson of being unconstructive.

Hsieh and I have recently had some words in comp.programming. They
started when I stated that some of his code was unnecessarily
non-portable, and grew rapidly from there.

Villy Kruse · Mar 18, 2005

How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.

We are talking about old mechanical teletypewriters with virtualy no
character buffering whatsoever. Thus, when a character is receive it
better be printed immediately before the next arrives, and if the
carriage hasn't fully returned to the beginning of the line then the
character will be printed wherever the carriage happens to be; usualy
in the middle of the line.

Villy

Richard Bos · Mar 18, 2005

David Mathog said:
Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to me.

The C Standard doesn't consider anything a text file or not; it leaves
that up to the implementation. If you manage to get hold of an
implementation that can open both extended-ASCII and Unicode files as
text files, and decode them correctly, that's fine according to the
Standard.
Actually, the Standard does say one thing: all the characters of the
basic character set must be positive, and the null character must have
value 0. Both ASCII (and all variations on it I know) and Unicode have
this property, so both can be used.

The first indication that you've got one is when "cat" works but
"grep" won't find any of the words that "cat" shows. On a Windows
system these act just like a text file: notepad, wordpad, and
DOS level TYPE and FIND all show the same thing, with no overt
indication that the file contains unicode.

That's because AFAIK newer versions of MS Windows use Unicode under the
hood, and convert MS-ASCII files on the fly. To those programs, Unicode
files _are_ text files even in the C meaning of the word.

Richard

Dave Vandervies · Mar 18, 2005

On Thu, 17 Mar 2005 21:23:26 +0000 (UTC),

We are talking about old mechanical teletypewriters with virtualy no
character buffering whatsoever. Thus, when a character is receive it
better be printed immediately before the next arrives, and if the
carriage hasn't fully returned to the beginning of the line then the
character will be printed wherever the carriage happens to be; usualy
in the middle of the line.

I was assuming (incorrectly, as noted elsethread) the ability to tell
the other end of the link "Don't send me any more characters until I've
had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the "store"
part of a store-and-forward mechanism.

dave

Eric Sosman · Mar 18, 2005

Dave said:
I was assuming (incorrectly, as noted elsethread) the ability to tell
the other end of the link "Don't send me any more characters until I've
had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the "store"
part of a store-and-forward mechanism.

<off-topic>

It's buffering, because it takes time for the "Please
stop" request to get back to the sender and for the sender
to act upon it, and during that time the characters keep
on coming.

Another method was to have the sender stop of its own
volition after sending the CR, until the terminal sent a
"Go ahead" when it was once again ready. This eliminated
the latency of the "Please stop" method, but at the cost
of some extra electronics in the terminal -- non-negligible
in the days of discrete-component circuit boards. It was
also prone to assorted synchronization deadlocks, where
each side was waiting for the other to utter something on
a silent (often half-duplex) line ...

Ah, those were the days! Less rosy by far than fading
memory paints them, but there's absolutely no denying that
they were "days."

</off-topic>

CBFalconer · Mar 18, 2005

Dave said:
I was assuming (incorrectly, as noted elsethread) the ability to
tell the other end of the link "Don't send me any more characters
until I've had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the
"store" part of a store-and-forward mechanism.

With a 33 Teletype there was no store. The CR simply released a
catch, and a spring sent the carriage hurtling left, to eventually
be caught by a dashpot. Other actions could happen during the
hurtling, such as line feeding, or pounding out a character on the
fly. Hurtling termination involved shaking of the system, and
stand walking down the floor. At some point, lacking positional
maintenance, this was likely to break the electrical connections.

You could be fairly confident that the hurtling was done after one
spare character time, much more so if you also sent a nul to gobble
up another 100 millisecs.

I think there was one transistor in the system. It was large and
powerful, and I forget what it was for.

Chris Croughton · Mar 19, 2005

<off-topic>

It's buffering, because it takes time for the "Please
stop" request to get back to the sender and for the sender
to act upon it, and during that time the characters keep
on coming.

Plus if you are sending as well you have to interrupt that to insert the
request. Unless you use the RTS/CTS/DTR/DCD lines, but they depended
(with modems) on actually switching the carrier off (and hence not
sending anything).

Another method was to have the sender stop of its own
volition after sending the CR, until the terminal sent a
"Go ahead" when it was once again ready. This eliminated
the latency of the "Please stop" method, but at the cost
of some extra electronics in the terminal -- non-negligible
in the days of discrete-component circuit boards. It was
also prone to assorted synchronization deadlocks, where
each side was waiting for the other to utter something on
a silent (often half-duplex) line ...

Electronics? I was talking about things like the Teletype(R) Model 33
ASR, which was totally mechanical. Inserting an X-OFF character
automatically would have been a real pain, there was no buffering at the
terminal at all.

Ah, those were the days! Less rosy by far than fading
memory paints them, but there's absolutely no denying that
they were "days."

Yup, they were days. And weeks...

(But at least you could read output as it appeared. 10 or possibly
30cps is readable, these things which scroll off the screen before you
even see that they've started are a pain...)

</off-topic>

Chris C

websnarf · Mar 19, 2005

David said:
Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to
me.

Yeah, its called "globalization". Anyways, the best that the C library
can do is read it as binary, then you are on your own for decoding
Unicode. Ironically, the wchar_t stuff is not a portable solution.

The first indication that you've got one is when "cat" works
but "grep" won't find any of the words that "cat" shows. On a
Windows system these act just like a text file: notepad,
wordpad, and DOS level TYPE and FIND all show the same thing,
with no overt indication that the file contains unicode.

Sounds like you need a better grep?

Run "od -c" on a unicode file and you'll find that it starts with
a Byte Order Mark (FE FF). After that every other byte is null.

Well, technically a UTF-16 file may start with either FE FF or FF FE,
and any of the octets that follow it may be NUL -- the encoding really
is a mapping to 16-bit values.

Take out the BOM and the null characters and you've got an ASCII file,
assuming it was originally written in english.

Better yet, convert it to UTF-8 and it remains as much ASCII as
required, while not losing any non-eglish characters. If you are
consistently seeing every other byte as NUL, then the author (or
program that the author is using) has almost certainly chosen a very
sub-optimal encoding (they should choose UTF-8 instead.)

[...] Somewhat ironically
all of the ones I've seen so far have only LF EOLs after being
processed like this. This is for UTF-16. There's also
UTF-32 but thankfully nobody has sent me one of those yet.

UTF-32 is mostly a "theoretical" transfer format. Its commonly used
internally within a program to simplify text data manipulation (and can
sometimes be mapped to "wchar_t"), however, nobody would ever use it as
a format for storing or sending a file. The reason is that UTF-16 is
always shorter than UTF-32, and UTF-8 is often shorter than both (but
sometimes can be longer than either.)

understanding fgets()	11	Jul 26, 2011
fgets problem	23	Dec 22, 2008
fgets() and extra characters...	11	Dec 6, 2005
question on fgets	13	Jul 27, 2008
fgets - design deficiency: no efficient way of finding last character read	43	Apr 11, 2012
fgets() replacement	20	May 28, 2004
Replacing fgets	32	Sep 17, 2006
fgets not doing as I expect.	8	Dec 26, 2008

fgets() and embedded null characters

Dave Vandervies

Eric Sosman

Chris Croughton

Keith Thompson

Keith Thompson

Peter Nilsson

Joe Wright

David Mathog

CBFalconer

Villy Kruse

Richard Bos

Dave Vandervies

Eric Sosman

CBFalconer

Chris Croughton

websnarf

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads