differance between binary file and ascii file

M

Malcolm

Keith Thompson said:
Then an 8-bit byte is addressible (with a little extra effort by the
compiler).
Thats where the whole language begins to break down. If pointers no longer
represent machine addresses then we've lost the primary design goal of C,
which is to allow the programmer direct access to the computer's memory.
Then you've got the problem that every void pointer has to be an extended
pointer, so conversions cause code to be executed. Then most registers hold
exactly an address, so do void pointers fit in registers? So perfectly
unexceptional code with nothing to do with octets but passing things around
in void *s hits the treacle.
 
D

Dik T. Winter

> osmium said: ....
>
> Knuth says that the 8-bit "standardisation" happened in around 1975 or so.

Indeed. And the word byte was indeed introduced by IBM, but originally
was intended for 6 bit quantities. With the 360 the new bytes became 8 bits.
 
H

Herbert Rosenau

I'll be damned! In Note 2, they defined byte very precisely as a word that
simply means a collection of contiguous bits. They took a widely used
word, that meant something to hundreds of thousands of people and redefined
it to mean something entirely different.

There are about 30 definitions of byte that make the cut on google, and the
*vast* majority say a byte is eight bits.

http://tinyurl.com/j79j4

That's simply appalling! Now the world needs a committee to define a word
for eight contiguous bits. How about naming it in honor of the clown who
got that inserted into the standard?

Historically, the smallest addressable unit of storage was a character.

Noways. The smallest addressable unit of storage was a word holding 12
decimal digits. That word was interpreted by the CPU as
- command:
- 4 decimal digit commad word
- 8 decimal digit memory address target/source date
to read/add/sub/store to/from accu from/to memory
- 8 decimal digit decimal constant to add/sub to/from accu
- 12 decimal digit numerical value
- up to 6 ASCI bytes high to low order when accessed with an
command specified as transferring text.

At that time there was no standard.
They seem to have gotten tangled up and ignored the distinction between a
character and a character code, and ignored the fact hat they were different
things. I think this made up example from history is right: The IBM 7094
has a six-bit character. The character code is BCD.

Note that addressable does not imply that a single character can be read
from memory, it only means there are hardware instructions to do something
useful at this level.


--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
C

Chris Hills

CBFalconer said:
If you are going to make a proposal then I suggest you start from
"the tight spec". Remember that even now things involving POSIX
are adequately housed on comp.unix.programming (or such) and
windows (cursed be its name) on microsoft.*.

One loosening that might be worthwhile is advice on how to make
various almost-C compilers hew to the various standards,

Why would they want to? The job of standards is to standardise industry
practice not go off on a whim and expect the industry to follow. That
is the problem with C at the moment. It has gone in a direction the
industry does not want to follow.
and their failings.

Others have commented on the failings of the standards including people
on the standards panels.
 
R

Robert Latest

On Sun, 14 May 2006 07:32:11 -0700,
in Msg. said:
That's simply appalling! Now the world needs a committee to define a word
for eight contiguous bits.

That word already exists: octet.

From wikipedia:

In computer technology and networking, an octet is a
group of 8 bits.

[...]

However, the size of a byte is determined by the architecture of a
particular computer system: some old computers had 9, 10, or 12-bit
bytes, while others had bytes as small as 5 or 6 bits. An octet is
always exactly 8 bits. As a result, computer networking standards
almost exclusively use "octet" to refer to the 8-bit quantity.

robert
 
S

S.Tobias

Richard Heathfield said:
vim said:


Well, that's really the wrong question.

The right question is: "what is the difference between a stream opened in
binary mode, and a stream opened in text mode?"
[sniped explanations of line translation]
If you don't /want/ the system to do this, open the file in binary mode. But
then managing the newline stuff all falls to you instead.
Why is there the text mode in the first place? All operations valid
for text streams seem to be valid for binary ones, too. Text streams
are more difficult to handle (eg. you can't calculate offsets, there's
some extra undefinededness). Apart from system compatibility, is there
any advantage to opening files in text mode?
 
J

Joe Wright

Mike said:
Sorry, I should have explained myself more clearly. At the moment I am
running on Windows with a Windows port of gcc. But, before I get
off-topic with environment specs, my real question is simply: does the
Standard require that stdin, stdout, and stdout be opened in a known
mode or is this detail left to the compiler? I ask because if I compile
the following (say as test.exe):

#include <stdio.h>

/* count occurences of '\r' in the input stream */
int main()
{
int c, count;
count = 0;
while ((c = getchar()) != EOF)
if(c == '\r')
++count;
printf("counted %d \\r's in input.\n", count);
return 0;
}

and I run the program with itself as input ("test < test.c"), the
result (on my machine) is

counted 12 \r's in input.

However, as I understand it, any '\r\n' sequences in the input stream
should have been mapped to '\n'. The only literature I can find
relating to this is what Microsoft has to say about how their C
compilers open stdin (they say it is opened in text mode in their
compilers). But what does the ANSI/ISO Standard say about what mode
stdin, stdout, and stderr are opened in? Maybe my compiler is just
misbehaving...

Mike S
At my house your program says..

counted 0 \r's in input.
 
P

P.J. Plauger

Richard Heathfield said:
vim said:


Well, that's really the wrong question.

The right question is: "what is the difference between a stream opened in
binary mode, and a stream opened in text mode?"
[sniped explanations of line translation]
If you don't /want/ the system to do this, open the file in binary mode.
But
then managing the newline stuff all falls to you instead.
Why is there the text mode in the first place? All operations valid
for text streams seem to be valid for binary ones, too. Text streams
are more difficult to handle (eg. you can't calculate offsets, there's
some extra undefinededness). Apart from system compatibility, is there
any advantage to opening files in text mode?

System compatibility is a damned important reason. Every system has its
own convention for representing text files, as created by text editors
and consumed by other text-processing programs. If that convention doesn't
match the C convention -- zero or more lines of arbitrary length, each
terminated by a newline -- somebody has to do some mapping. Whitesmiths,
Ltd. introduced the text/binary dichotomy in 1978 when porting C to
dozens of non-Unix systems, and other companies did much the same thing
in the coming years. It was a slam dunk to put it in the draft C Standard
begun in 1983.

If you try to live with just binary mode, then every program either has
to map text files for itself or tolerate a broad assortment of rules for
delimiting text lines. There's precedent for the latter approach too
(see, for example, Java), but Unix gives a powerful precedent for
having a uniform internal convention for representing text streams.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
F

Flash Gordon

S.Tobias said:
Richard Heathfield said:
vim said:

Well, that's really the wrong question.

The right question is: "what is the difference between a stream opened in
binary mode, and a stream opened in text mode?"
[sniped explanations of line translation]
If you don't /want/ the system to do this, open the file in binary mode. But
then managing the newline stuff all falls to you instead.
Why is there the text mode in the first place? All operations valid
for text streams seem to be valid for binary ones, too. Text streams
are more difficult to handle (eg. you can't calculate offsets, there's
some extra undefinededness). Apart from system compatibility, is there
any advantage to opening files in text mode?

Without text streams how can you produce a C source file that is
guaranteed to produce a valid text file on whatever system you run the
program on? Historically systems have used rather more schemes than just
terminating lines with CR, CRLF or LF, some have used some form or
record format, e.d. the first couple of bytes on a line saying how long
the line is.

Who is to say that in the future a system might not choose to encode the
file type as a mime header? Then the system might not even let you open
a text file as a binary file, or open a binary file as a text file, and
such restrictions could be useful. Also, on such a system, if you
created a file as a binary file the normal text editor of the system
might refuse to open it!

So the compatibility aspect is pretty major.
 
K

Kenneth Brody

Richard Heathfield wrote:
[...]
The newline marker defined by C is '\n'.
[... How various systems "really" store EOL in a file ...]
On the mainframe - well, you /really/ don't want to know.
[...]

I recall using a "mainframe" (okay, really a micro version of a mainframe)
that didn't have any "end of line marker" in text files. Instead, each
line started with two bytes containing the length of the line, and the
contents would be padded to an even number of bytes. So, a text file with
two lines -- "hi" and "there" -- would actually contain thissequence of
bytes:

0x02, 0x00, 'h', 'i', 0x05, 0x00, 't', 'h', 'e', 'r', 'e', 0x00

This is also a perfect example why you can't pass arbitrary values to
fseek().

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody/at\spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+
Don't e-mail me at: <mailto:[email protected]>
 
C

CBFalconer

P.J. Plauger said:
.... snip ...


System compatibility is a damned important reason. Every system has its
own convention for representing text files, as created by text editors
and consumed by other text-processing programs. If that convention doesn't
match the C convention -- zero or more lines of arbitrary length, each
terminated by a newline -- somebody has to do some mapping. Whitesmiths,
Ltd. introduced the text/binary dichotomy in 1978 when porting C to
dozens of non-Unix systems, and other companies did much the same thing
in the coming years. It was a slam dunk to put it in the draft C Standard
begun in 1983.

If you try to live with just binary mode, then every program either has
to map text files for itself or tolerate a broad assortment of rules for
delimiting text lines. There's precedent for the latter approach too
(see, for example, Java), but Unix gives a powerful precedent for
having a uniform internal convention for representing text streams.

However the user should be aware that everything breaks down if the
input system tries to handle a file as text when that file doesn't
adhere to the conventions for text on the system. Thus a
windows/dos file bodily transferred to a linux system will have
those extraneous '\r's. A linux file transferred to windows may
appear to be one long line with no '\n's.

Some programs deliberately treat all files as binary, and try to
make their own decisions about the format. I believe gcc is one of
these.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
C

CBFalconer

Chris said:
.... snip ....\

Why would they want to? The job of standards is to standardise industry
practice not go off on a whim and expect the industry to follow. That
is the problem with C at the moment. It has gone in a direction the
industry does not want to follow.

Others have commented on the failings of the standards including people
on the standards panels.

You misconstrued. For example, to bring gcc into conformity one
can use "-W -Wall -ansi - pedantic". The 'they' is the user.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
P

P.J. Plauger

However the user should be aware that everything breaks down if the
input system tries to handle a file as text when that file doesn't
adhere to the conventions for text on the system.

Well, yes.
Thus a
windows/dos file bodily transferred to a linux system will have
those extraneous '\r's. A linux file transferred to windows may
appear to be one long line with no '\n's.

You mean, no '\r's I assume.
Some programs deliberately treat all files as binary, and try to
make their own decisions about the format. I believe gcc is one of
these.

Could be, but IIRC it gets bent out of shape if a backslash macro
continuation is not immediately followed by a newline. So its effort
is only half hearted.

Put simply, there's no substitute for making all text files on a
system obey the conventions for text files on that system.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
C

CBFalconer

P.J. Plauger said:
.... snip ...

You mean, no '\r's I assume.

No, I meant what I said. I am assuming the text input system looks
for the sequence \r\n before transmitting a \n. The other
(probably more likely) interpretation will just work, and nobody
need worry unduly about that. And I did say 'may appear'.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
M

Mike S

CBFalconer said:
However the user should be aware that everything breaks down if the
input system tries to handle a file as text when that file doesn't
adhere to the conventions for text on the system. Thus a
windows/dos file bodily transferred to a linux system will have
those extraneous '\r's. A linux file transferred to windows may
appear to be one long line with no '\n's.

Some programs deliberately treat all files as binary, and try to
make their own decisions about the format. I believe gcc is one of
these.

I'm fairly new to C and especially to ANSI/ISO C, but it seems somewhat
strange to me that the Standard (AFAIK) doesn't attempt to regulate
this aspect of the language as far as the standard file descriptors go.
For example, to me it would seem logical to *always* open stdin in text
mode, so that redirected input would work correctly regardless of the
platform, since I think it's fair to say that most redirected input to
real-world apps and command-line utilities is in the form of text files
and not binary data files. Some (most?) compilers will link code into
compiled programs to choose the "best" mode for stdin based on how you
give your C programs input. Cygwin gcc is one example: it puts stdin in
text mode if no redirection occurs; however, if you are redirecting
input from a file (even if it's a text file), it will switch stdin into
binary mode, unless you a) explicitly force stdin into text mode in the
source code or b) override this behavior with an external environment
variable.... Wouldn't it make more sense to clearly define this
behavior instead of leaving it to the whim of the specific compiler you
happen to using at the moment? For example, why not have something in
the Standard to the effect of: "Upon entering main(), the standard
streams stdin, stdout, and stderr shall be in text mode"? Then the
programmer need not worry about compiler quirks like the one I
mentioned above when parsing text files from redirected input, since
newline translation would be guaranteed to occur unless or until the
programmer explicitly switches a stream from text mode into binary mode
at the source-code level.

Being a C newcomer and a complete novice in all things Standard, I
wouldn't be at all surprised if my argument here is overly simplistic
or even unfounded. What are your (everyone's) thoughts on this idea, or
has it already be discussed and discarded, or beyond the scope of what
the Standard is meant to define? I'm interested to see your thoughts
and comments on this.

Mike S
 
P

pete

Mike said:
For example, why not have something in
the Standard to the effect of: "Upon entering main(), the standard
streams stdin, stdout, and stderr shall be in text mode"?
Being a C newcomer and a complete novice in all things Standard, I
wouldn't be at all surprised if my argument here is overly simplistic
or even unfounded.

N869
7.19.3 Files
[#7] At program startup, three text streams are predefined
and need not be opened explicitly -- standard input (for
reading conventional input), standard output (for writing
conventional output), and standard error (for writing
diagnostic output).
 
H

Herbert Rosenau

Without text streams how can you produce a C source file that is
guaranteed to produce a valid text file on whatever system you run the
program on? Historically systems have used rather more schemes than just
terminating lines with CR, CRLF or LF,

It is simple. A stream is an absract form of data I/O. There is
nothing that requires that a C program has to see cr or lf or cr/lf on
the dource for input or output stream. When the underlying OS says
that a line is a record of fixed lengh, or variable lenth with some
special hints how long the record is or uses some 17 bit special code
on 42 bit chars it would simple the C runtime that will produce a line
that contains \n on the real end (not identical with the physical end
of the record anyway and will extend an empty line "\n" to the given
format.

That is wh a stream is defined as abstract and has nothing to do with
the physical layout of physical data unit it is stored on. That is
because the standard says not a single word about screen, keyboard,
printer, disk....... but says stream to anything.

There are really some systems around using fixed records, records with
variable lengh and so on. Get conforming C source and use a compiler
that is able to compile it to that target and each conforming program
will work.

some have used some form or
record format, e.d. the first couple of bytes on a line saying how long
the line is.

Yeah, but when you use the I/O defined in the standard you have no
need to know how data is written.
Who is to say that in the future a system might not choose to encode the
file type as a mime header? Then the system might not even let you open
a text file as a binary file, or open a binary file as a text file, and
such restrictions could be useful. Also, on such a system, if you
created a file as a binary file the normal text editor of the system
might refuse to open it!

So the compatibility aspect is pretty major.

That is given since BCPL, the predecessor of C already. That is why C
does nothing knows about directories too.

When your program has a need to handle directories or exact knowledge
of how data gets written you has to use system specific functions and
leaves the area of conforming C programs.

The standard defines the minimal requirements an C implementation have
to fullify but lets enough room to the implementation to do it the
best way its underlying OS in hosted environment or the pure
implementation on freestanding environment can fit.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
H

Herbert Rosenau

Richard Heathfield wrote:
[...]
The newline marker defined by C is '\n'.
[... How various systems "really" store EOL in a file ...]
On the mainframe - well, you /really/ don't want to know.
[...]

I recall using a "mainframe" (okay, really a micro version of a mainframe)
that didn't have any "end of line marker" in text files. Instead, each
line started with two bytes containing the length of the line, and the
contents would be padded to an even number of bytes. So, a text file with
two lines -- "hi" and "there" -- would actually contain thissequence of
bytes:

0x02, 0x00, 'h', 'i', 0x05, 0x00, 't', 'h', 'e', 'r', 'e', 0x00

This is also a perfect example why you can't pass arbitrary values to
fseek().
Yeah, but you would run the classic "hello world\n" program
successfully without knowing that details of that implementation. This
is one of the points that makes C so highly portable.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
F

Flash Gordon

Herbert said:
It is simple. A stream is an absract form of data I/O. There is

<snip>

I think you are in violent agreement with me. I was responding to a
questions about why C has text streams as well as binary streams with an
explanation of the problems if it did not. You are explaining why C
programs see an abstraction (e.g. text and binary streams with the
system) specifics handled at a lower level.
--
Flash Gordon, living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidelines and intro:
http://clc-wiki.net/wiki/Intro_to_clc

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
 
B

Ben Pfaff

Mike S said:
I'm fairly new to C and especially to ANSI/ISO C, but it seems somewhat
strange to me that the Standard (AFAIK) doesn't attempt to regulate
this aspect of the language as far as the standard file descriptors go.
For example, to me it would seem logical to *always* open stdin in text
mode, so that redirected input would work correctly regardless of the
platform, since I think it's fair to say that most redirected input to
real-world apps and command-line utilities is in the form of text files
and not binary data files.

The Standard refers to the standard streams as "standard text
streams", so presumably they are supposed to be in text mode.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,520
Latest member
KrisMacono

Latest Threads

Top