Binary or Ascii Text?

C

CBFalconer

P.J. Plauger said:
.... snip ...

You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.

Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc. The world is
not a Unix machine.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
B

Ben Pfaff

P.J. Plauger said:
Dunno how CR would be any better off than BS if that was the case.

I had a dot-matrix printer once (Okidata ML520) that would
overheat and stop (until it cooled down) if you sent it too much
text that contained lots of backspaces to do
character-by-character bold or underline. That kind of thing
made the printhead go back and forth incredibly rapidly, and it
just wasn't designed for that.

On the other hand, using CR didn't cause a problem because it
didn't make the printhead reverse direction any more often than
normal.
 
P

P.J. Plauger

Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc.

Right, and it's a damn poor model, with terrible lookahead properties.
Kernighan and I had to really work at imposing decent primitives atop
it. It is no accident that the model hasn't survived.
The world is
not a Unix machine.

Actually, it is. Compare the operating systems of today with those
of 35 years ago and you'll see how ubiquitous the basic design
decisions of Unix have become. Line terminators are at least now
always embedded characters in a stream -- gone are padding blanks
and structured files -- if not always the same terminators. And
C is certainly ubiquitous, with its simple rules for mapping
C-style text streams to and from text files.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
P

P.J. Plauger

I had a dot-matrix printer once (Okidata ML520) that would
overheat and stop (until it cooled down) if you sent it too much
text that contained lots of backspaces to do
character-by-character bold or underline. That kind of thing
made the printhead go back and forth incredibly rapidly, and it
just wasn't designed for that.

On the other hand, using CR didn't cause a problem because it
didn't make the printhead reverse direction any more often than
normal.

Okay, you've made a case for why a good printer *driver* might
rewrite the stream you send it (as practically every smart device
did in Unix and does in today's systems). The issue we've been
discussing is the *linguistics* of text streams. And the point
was that either CR or BS is sufficient to describe overstrikes.
ASCII doesn't have any thermal attributes.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
C

CBFalconer

P.J. Plauger said:
Right, and it's a damn poor model, with terrible lookahead properties.
Kernighan and I had to really work at imposing decent primitives atop
it. It is no accident that the model hasn't survived.

I probably should't get into this :) but people have been
misunderstanding Pascal i/o for generations now. With the use of
lazy i/o there is no problem with interactive operation, and
prompting can be handled with a prompt function (equivalent to
writeln, but without the line advance) or by detection of
interactive pairs to force buffer flushing.

Meanwhile there are none of the problems associated with
interactive scanf and other routines, because the C stream is never
sure whether the field terminating char has been used or is still
in the stream. With Pascal, it is in the stream. With Pascal, we
always have one char. lookahead.

Granted, we can build the equivalent set in C, but that requires
the discipline to not use many existing functions, or to follow
them with an almost universal ungetc. What we can't get is the
convenience of the shorthand usage of read(ln) and write(ln),
although the C++ mechanisms make an ugly attempt at it.
Actually, it is. Compare the operating systems of today with those
of 35 years ago and you'll see how ubiquitous the basic design
decisions of Unix have become. Line terminators are at least now
always embedded characters in a stream -- gone are padding blanks
and structured files -- if not always the same terminators. And
C is certainly ubiquitous, with its simple rules for mapping
C-style text streams to and from text files.

Granted the Unix philosophy has simplified file systems. This is
not necessarily good, since the old systems all had reasons for
existing. Many of those reasons have been subsumed into much
higher performance levels at the storage level, but that is
something like approving of gui bloat because cpus are faster.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
P

P.J. Plauger

I probably should't get into this :) but

You're probably right.
people have been
misunderstanding Pascal i/o for generations now.

That may be, but I don't. I've written tens of thousands of lines
of Pascal and hundreds of thousands of lines of C over the past
few decades. I've written essays on the various design principles
of parsing with various degrees of lookahead. I've written or
coauthored textbooks on the subject. In short, I've *thought*
about this topic for longer than the average reader of this
newsgroup has been alive. I think I understand it.
With the use of
lazy i/o there is no problem with interactive operation, and
prompting can be handled with a prompt function (equivalent to
writeln, but without the line advance) or by detection of
interactive pairs to force buffer flushing.

Yes, you can get around the problems. The only problem is that you
*have* to get around the problems.
Meanwhile there are none of the problems associated with
interactive scanf and other routines, because the C stream is never
sure whether the field terminating char has been used or is still
in the stream.

Not true. It's precisely, and usefully, defined.
With Pascal, it is in the stream.

Not always true.
With Pascal, we
always have one char. lookahead.

And with C. You never need more than one char lookahead, by design.
Granted, we can build the equivalent set in C, but that requires
the discipline to not use many existing functions, or to follow
them with an almost universal ungetc. What we can't get is the
convenience of the shorthand usage of read(ln) and write(ln),
although the C++ mechanisms make an ugly attempt at it.

I agree that, beyond a point, this becomes a matter of aesthetics.
I won't argue that. What I will observe is natural selection at
work. The C I/O model has survived and thrives. The Pascal model
is marginalized if not dead.
Granted the Unix philosophy has simplified file systems. This is
not necessarily good, since the old systems all had reasons for
existing.

Yes, they did. Lots of them. In all sorts of directions. And they
haven't survived. Coincidence? I don't think so.
Many of those reasons have been subsumed into much
higher performance levels at the storage level, but that is
something like approving of gui bloat because cpus are faster.

No, it's something like adapting the total software package to the
needs of current hardware. I see no overall bloat in how buffering
is distributed today vs. 30 years ago. But I do see a significant
simplification of I/O as seen by the user over that same period.

Item: One of the seven looseleaf binders that came with RSX-11M was
titled "Preparing for I/O." There is no Unix equivalent. (Or DOS,
or Linux, or ...) You don't set up file control blocks and I/O
control blocks; you just call open, close, read, and write.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
D

Dik T. Winter

> Now that we know ascii text only use 7 bits of a byte and the first bit
> is always set as 0. So I wonder if I could write a program to get a
> fixed length of a given file(for example, the first 1024 bytes) , to
> store them in a unsigned char array and to check if there is any
> elements greater than 0x7F. If any, the file can be judged as a binary
> file.
>
> However, the disadvantage of the above method is that it cannot handle
> the multi-byte character. Take the UTF-8's japanese character for
> example, a japanese character may be encoded as three bytes and some of
> them may be greater than 0x7F? In that case, my method will make no
> sense.

No, it cannot handle other encodings, but that was not what you asked for.
Note that also files that consist of pure ASCII codes can be binary.
 
C

CBFalconer

P.J. Plauger said:
.... snip ...


Not true. It's precisely, and usefully, defined.


Not always true.


And with C. You never need more than one char lookahead, by design.

Now we can get this off a language war and onto pure C. My problem
with C, and the usual library, is the absence of sane and clear
methods for interactive i/o. To illustrate, the user wants to
output a prompt and receive an integer. How to do it?

The new users first choice is probably scanf. He forgets to check
the error return. And, even worse, what gets entered is:

<programmed prompt>: 1234 x<cr>

and this is being handled by:

printf("<programmed prompt>:"); fflush(stdout);
scanf("%d", &i);

which gets the first entry, but falls all over itself when the
sequence is called again. The usual advice is to read full lines,
i.e."

printf("<programmed prompt>:"); fflush(stdout);
fgets(buffer, BUFSZ, stdin);
i = strtol(buffer, &errptr, 10);

which brings in the extraneous buffer, a magical BUFSZ derived by
gazing at the ceiling, prayer, and incense sticks, not to mention
errptr. So I consider that solution unclean by my standards. (Of
course they can use my ggets for consistent whole line treatement).

So instead we write a baby routine that inputs from a stream with
getc, skips leading blanks (and possibly blank lines), and ungets
the field termination char. We combine that with my favorite
flushln:

while ((EOF != (ch = getc(f)) && ('\n' != ch)) continue;

and the birds twitter, the sun shines, etc. UNTIL somebody calls
some other input routine and doesn't have the discipline to define
a quiescent i/o state and ensure that that state is achieved at
each input. That in turn leads to calling the flushln twice, and
discarding perfectly usable (and possibly needed) input.
Alternatively it leads to newbies calling fflush(stdin) and similar
curses.

This is what I mean by saying the C doesn't provide the one char
lookahead in the right place, i.e. the i/o, where it can't be lost.

It would help if C provided the ability to detect "the last
character used was a '\n'", which would enable the above flushln to
avoid discarding extra lines. However that won't happen. It would
probably also suffice to define fflush usage on input streams.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
H

Herbert Rosenau

I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of actual
drum printers, chain printers and so on.

So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return? What was the point then, of
having them as separate codes? Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.

In late 60th and early 70the there was no device available today known
as screen. Tere were line printers, punch card and paper reader and
writers, and TTY devices combining keyboard, puch paper reader and
writer and a character printer. That printer was able to use singe
control chars like
- cr - caridge return - point print unit back to column 1
- lf - linefeed - feed paper to next line
- ff - formfeed - feed paper to next page stop on
the control ribbon
- backspace - one fixed character position back on same
line
- backline - page one line back

Some of these devices werde dumb enogh to get the next character
printed even before the device was able to reach character position 1.
So to get a clean printout you had to do cr before lf to hold the
device until lf was done.

Anyway to get a new line you must give out lf or the prit head would
put the char on the position it was at the time it got the order to
print it.

On mainframes the TTY used was mainly configured to make a cr even
when it got an lf to optimise the programs and save one character in
text (memory was bare and expensive even as the was able to
multitask). The upcoming microprocessors (mostenly homebrowed by
highly different manufacturers were limited in multitasking on the
different hardware levels (mostenly 16) the CPU was able to control
and designed more primitive. They required even more dumb TTY or more
intelligent customer builded I/O devices.

At the time C was created there was a typica computer either a
mainframe with
- a lot of punch card readers as program input
- a lot of magnetic tape devises as data store
- 1 or more punch card writer(s)
- some paper tape readers and writers
- one or more line printers (the first music devices :)
for developers)
- later then a high number of removeable hard disk
- 1 TTY as operator console

No wouder that the C runtime is not created to handle user input well
but ideal for handling computer designed input like punch cards.

The upcoming microprocessors were designed to control mashines, having
only
- special devices to control mashines
- paper tape punchers and readers
- magnetic tape writers
- seldom line printers
- TTY as operator console.

Ages later they got moved into bureaus and other kinds of special
devices and TTY like devices as user input/output devices.

Modern GUIs are properitary anyway and does not use the C runtime for
user oriented I/O anyway.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
D

Dave Thompson

I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this

Are you sure? The far-dominant early ASCII (64-graphic = uppercase
only) devices, Teletype 33 and 35, had uparrow and backarrow. The
earliest revision of the standard document I looked at, IIRC 1968 or
so, added tilde along with lowercase and described circumflex and
underscore as changed precisely so they could be used as modifiers. It
also gave NL as an acceptable alternate meaning of x0A but not the
primary one. (There was the same ambiguity over whether VT and FF
included CR or not, but those were already less important then, and
now have nearly vanished.) And of course ASCII was originally intended
and used only as an "American" (meaning US) standard.
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium. I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have no
actual proof.

So it was a concession to ASCII.
I would say to ASCII as commonly used, _and_ to other non-Unix and
record-oriented filesystems still pretty important in the 1980s.

- David.Thompson1 at worldnet.att.net
 
O

osmium

Dave Thompson said:
Are you sure? The far-dominant early ASCII (64-graphic = uppercase
only) devices, Teletype 33 and 35, had uparrow and backarrow. The
earliest revision of the standard document I looked at, IIRC 1968 or
so, added tilde along with lowercase and described circumflex and
underscore as changed precisely so they could be used as modifiers. It
also gave NL as an acceptable alternate meaning of x0A but not the
primary one. (There was the same ambiguity over whether VT and FF
included CR or not, but those were already less important then, and
now have nearly vanished.) And of course ASCII was originally intended
and used only as an "American" (meaning US) standard.

I would say to ASCII as commonly used, _and_ to other non-Unix and
record-oriented filesystems still pretty important in the 1980s.

I work for a main-frame manufacturer and what I said was partly based on
discussions with the guy who had represented us when the standard was
written. I detested ASCII (still do) and he was a big proponent and
defender. I hated the fact that about 25 codes were wasted to suit the
Teletype people where one could almost fit the Greek alphabet in, maybe a
blend of lower and upper case depending on common usage in the sciences. I
also thought (and think) a seven bit code was ridiculous. I thought the
sheet of glyphs that I had was from the first release of ASCII, but I can't
attest to that. I know there is no hint of a NL in it, or a soft meaning
for CR or LF. Yes, it is an American code and I don't know if there is
*any* language that can be correctly transcribed in ASCII, but the point is
they *tried*. The grave, circumflex, tilde and virgule (old? Norwegian) are
a pretty modest start. BTW the sheet I have shows the vertical bar as a
broken vertical bar - looks kind of IBMey. And I vaguely recall seeing the
"hooked bar" of EBCDIC in lieu of the tilde in some version or other of
ASCII.

I think there was a post in this thread that spoke of using BS instead of
the LF CR business that I didn't repond to. That would only work for
*character* printers. It doesn't work on line printers. BTW, I suspect that
tons of computer fluent people would be amazed to see a line printer at
work, cranking out beaucoup *pages* per minute.

There is a long thread of 282 messages on the NL situation but I don't have
the time or interest to read it all. I did note message #135 from Dennis
Ritchie supports my claims, as far as I can see. I need an intern to read
and digest this kind of thing for me. :)

http://groups.google.com/group/alt....klore.computers&rnum=4&hl=en#023045858df1b784
 
D

Dik T. Winter

Indeed. The distinction predates Microsoft.

The first copy of ASCII shows neither circumflex, nor tilde. The first
version of ASCII that shows both dates from 1965, and indeed has tilde and
circumflex in upper case positions. However, although that version was
ratified, it was never published, nor used.
> Are you sure? The far-dominant early ASCII (64-graphic = uppercase
> only) devices, Teletype 33 and 35, had uparrow and backarrow.

Yup, ASCII 1963. The arrows and the backslash.
> The
> earliest revision of the standard document I looked at, IIRC 1968 or
> so,

The earlies revision was 1965. Adding lower case letters, tilde,
circumflex, underscore, braces, not-symbol and vertical bar. The
arrows and the backslash were removed. But (as I said) although
ratified, never published, nor used.
> added tilde along with lowercase and described circumflex and
> underscore as changed precisely so they could be used as modifiers.

ASCII 1967 removed the not-symbol and re-added the backslash when
comparing to 1965, but also a few positions were moved: the tilde
moved from uppercase to lowercase. But the use as modifiers was
complex. All of apostrophe, grave accent, umlaut, tilde and
circumflex could be modifiers, depending on context.
> It
> also gave NL as an acceptable alternate meaning of x0A but not the
> primary one.

Indeed. The meaning 'NL' was only to be used if both sender and receiver
had agreed on the meaning.
> And of course ASCII was originally intended
> and used only as an "American" (meaning US) standard.

The reason ASCII 1965 was never published was because the committee had
become aware that there was an international effort for standardisation.
This ultimately lead to ASCII 1967 which is equal to the ISO version of
that time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,177
Messages
2,570,954
Members
47,507
Latest member
codeguru31

Latest Threads

Top