binary vs text mode for files

L

Les Cargill

kerravon said:
Hello.

In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode. If it was opened in binary mode there would
not be anything special about the sequence, and that
sequence just happened by random, when we're perhaps
dealing with a zip file.

My question is - how do other languages like BASIC,
Pascal, Fortran, Cobol, PL/1 deal with this
fundamental difference between binary and text file
processing?

There is a "raw" and "cooked" mode for all versions of DOS
( including Win8 ).

SFAIK, when you do an fopen("x.bin","rb") it uses raw mode.
People suggested that C was odd for having this
differentiation.

SFAIK, it's only 'C' on DOS/Windows machines, but this was down to the
BIOS level ( int21 ) so who knows?

You can configrue a file in Tcl with "fconfigure $f -translation binary"
and ti works the same cross platform.

There is a similar thing in 'C' in Linux - never can remembe rwhat it's
called.
 
S

Stefan Ram

Keith Thompson said:
This is one of my "If I had a time machine" projects: going back
to the 1960s and persuading everyone to standardize on a single

Building a time machine to travel into the past is easy
compared to persuading people to agree on a single standard.
 
B

BartC

Keith Thompson said:
Quite well, when reading and writing native files.


A better way to deal with that is to translate non-native files before
feeding them to native programs. Let a translation utility do the work,
and let everything else just deal with text.

This is the problem: how do you know the provenance of every single text
file littering your system? How does a piece of software know? Unless you
have some sort of firewall that examines every possible file coming in and
does the necessary conversions. But that's a bit draconian (and who's going
to be first to do that? The assumption has to be that such a thing doesn't
exist).

And apart from line-endings, there are many details to worry about with text
files (character encoding for example). Unless you are actually running on
one of those rare systems that use none of those three line-ending schemes
I've mentioned. But since I've never used one in my whole life, nor am
likely to, I'm not going to take such a possibility seriously. (And if it
ever does come up, it can just be treated as a special file format to be
dealt with, like a hundred others.)
I haven't used clang on Windows. Assuming it behaves as you
describe, I'd say that's a bug in clang. That doesn't make your
approach reasonable.

Let's call it a bug in my programs then. (Which, if anyone else other than
me is ever going to use, and they want to run them on an LF system and they
want it to read and generate LF on that system, then I might fix it.)
It's probably the fault of whoever copied the file without thinking
about the format.

Then it's going to be the fault of millions of people who download text
files from the internet ever day.
 
M

Malcolm McLean

This is one of my "If I had a time machine" projects: going back
to the 1960s and persuading everyone to standardize on a single
text file format, with a single character (neither "linefeed" nor
"carriage return") representing the end of a line. And consistent
byte order (flip a coin, I don't care which). And no EBCDIC.
And no UTF-16. And plain char is unsigned.
You need a soft and a hard line break.
Soft line breaks are hints to the output device about rendering, hard line breaks carry information. So this post should have a hard line break after the first sentence, and soft line breaks so you can view the last paragraphcomfortably. Let's see what Google and newsreaders do with it.
 
E

Eric Sosman

[...]
This is one of my "If I had a time machine" projects: going back
to the 1960s and persuading everyone to standardize on a single
text file format, with a single character (neither "linefeed" nor
"carriage return") representing the end of a line. [...]

Sorry, Keith, but you're completely in the wrong here.
A "line" is one hundred thirty-three characters, the first
being metadata that isn't actually printed but governs the
vertical spacing, and the rest being payload.[*] There are
no end-of-line characters, no control characters of any kind
for that matter.

[*] I once testified to this under oath as an expert
witness -- not in a full-scale trial, but in a hearing before
an administrative judge. When the other side's attorney was
cross-examining me about this and other silliness, I imagine he
must have been thinking "Why, oh why did I go to law school
instead of getting a nice job flipping hamburgers?"
 
R

Rosario193

Hello.

In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text

i not speak for standard C that has the text mode for open file etc

i say that all would be easier if each char is 8 bit and the text is
one array of chars... so all binary
 
G

glen herrmannsfeldt

(snip)
Convince the young Bill Gates that '/' separates directory elements
not '\' and that '-' introduces a command-line option, not '/'.
Please. ;-)

I believe that DEC was using / for command-line options early than unix.

They were, at least, when I started working with DEC systems in 1976,
and I believe that they weren't so new at the time.

Early MS software was done using cross assemblers running on DEC
systems, so MS naturally started using them.

I don't know that it precludes using / for a path separator, but it
does make it more complicated. You might look at how VMS does
its path indications.

-- glen
 
G

glen herrmannsfeldt

(snip)
"Command Line Interface Wars" predate the more famous later
Apple-Microsoft mouse squabbles. Lot's of things were 'reversed' by
various vendors just to avoid an argument.
For example,
Copy <target> <items> <...>
Copy <items> <...> <target>

Not to mention that different assemblers (and often the underlying
machine code) do move operations in one of those two ways.

-- glen
 
G

glen herrmannsfeldt

(snip)
Sorry, Keith, but you're completely in the wrong here.
A "line" is one hundred thirty-three characters, the first
being metadata that isn't actually printed but governs the
vertical spacing, and the rest being payload.[*] There are
no end-of-line characters, no control characters of any kind
for that matter.

Do the metadata characters indicate what to do before or
after printing the specified data? (FBA or FBM, to be specific.)

I once had someone send me some files from an IBM system with
variabe length records and no line termination. After I complained
about that, they resent them with fixed length and no termination.

Seems that the default for MVS's ftp in binary mode is not to add
any line terminators.

-- glen
 
K

Kaz Kylheku

Convince the young Bill Gates that '/' separates directory elements
not '\' and that '-' introduces a command-line option, not '/'.
Please. ;-)

Every version of DOS and Windows has supported / separators.

Any problems using / are due to the "user space" "shell" around the OS.

Early versions of COMMAND.COM had a configurable variable to set your
preference to forward slash or backslash. This was eventually removed.

So, use your time machine to convince the developer who decided to remove this
switchability, setting a precedent for applications and shells to ignore the
existence of the other separator.

As for command line options, I would prefer that the time machine be used more
productively: please convince Microsoft that the command line passed to child
programs should be broken into arguments *by the OS*, so that applications
receive some kind of array of strings, and not one big string, which
every program parses in an ad-hoc way according to its own conventions. If
that could be done, I will ask no more.
 
K

kerravon

I once had someone send me some files from an IBM system with
variabe length records and no line termination. After I complained
about that, they resent them with fixed length and no termination.

Seems that the default for MVS's ftp in binary mode is not to add
any line terminators.

Can you elaborate on this? What was the data?
If you were expecting a text file with line
separators, why were you transferring in
binary mode?

If it was a binary file, with variable length
records, what did you expect to see to break
up the records? And what software would be used
to process those records?

Also see this:

http://pdos.cvs.sourceforge.net/viewvc/pdos/pdos/pdpclib/folks.c?view=markup

BFN. Paul.
 
K

Kaz Kylheku

Not sure what you mean here. Microsoft C (since 5.1) has always
provided the command line as an "array of strings". Both from the

But since when is Microsoft C (since 5.1) part of the operating system?
There is a Windows API call GetCommandLine() that appears to be a
'single' string, but can be converted to a 'char* argv[]' with a bit

Now you're warmer. It more than appears to be a string; it is one.
of reassignment.

Breaking a command line into individual arguments is non-trivial, and
"everyone" ends up doing it in slightly incompatible ways.

You cannot just break it on whitespace because the one big command line string
may contain quotes, so that it can express argument that contains pace. Then
it's a matter of what kind of quotes. Are single quotes within double quotes
literal or just nested? And how are quotes escaped? Is the \" sequence an
escape or taken literally? What escapes the backslash then, \\ or not?
Is \" consistently processed outside of a quote too?

The end result is that on Microsoft platforms, you have to know exactly how the
specific program you are invoking in order to prepare that command line string
so that it is correctly parsed by that program.
 
G

glen herrmannsfeldt

Kaz Kylheku said:
(snip)
Every version of DOS and Windows has supported / separators.

Well, except 1.x, as, unless I remember wrong, subdirectories didn't
come until 2.0.
Any problems using / are due to the "user space" "shell" around the OS.
Yes.

Early versions of COMMAND.COM had a configurable variable to set your
preference to forward slash or backslash. This was eventually removed.

So, use your time machine to convince the developer who decided to remove this
switchability, setting a precedent for applications and shells to ignore the
existence of the other separator.

I suppose, but that doesn't really help if you are using someone
else's system. I suppose it does for your own, but then you can
run a shell other than CMD or COMMAND.
As for command line options, I would prefer that the time machine
be used more productively: please convince Microsoft that the
command line passed to child programs should be broken into
arguments *by the OS*, so that applications receive some kind
of array of strings, and not one big string, which every
program parses in an ad-hoc way according to its own conventions.
If that could be done, I will ask no more.

Well, VM/CMS does parse the command line. First it converts
the whole line to upper case. Then it passes the first eight
characters of each argument string to the program. That might
not be quite what you wanted.

The MS C compiler I used to use came with a routine that you
could link that would parse the command line. So, not quite as
general as you ask, but at the time MSC was pretty much the C
compiler for DOS (and OS/2).

The one I can never figure out is the quoting convetion for CMD.


-- glen
 
G

glen herrmannsfeldt

(snip, I wrote)
Can you elaborate on this? What was the data?
If you were expecting a text file with line
separators, why were you transferring in
binary mode?

That was some time ago, and I don't remember the details.

For one, you might want to transfer a file of EBCDIC text data
without translating it.

As I remember it, I found out later that there is an option to
transfer the data along with the RDW (length fields) that come
before each line, but the default is not to do that.

I also have a transfer of an IEBCOPY unloaded load module, again
with the block lengths (RECFM=U) removed. Sometime I will get
back to that, as I believe that load modules have length information,
but to undo the unload, it has to be converted back to the form
that IEBCOPY expects first.
If it was a binary file, with variable length
records, what did you expect to see to break
up the records? And what software would be used
to process those records?

Length fields on each record.

-- glen
 
K

Kaz Kylheku

The OS provides a function that the application can use to do that
pretty simply (CommandLineToArgvW).

Unfortunately, Microsoft's own applications, like CMD.EXE, do not use it.

The command processor processes quotes on arguments, so that

dir "file"
dir file

produce consistent results. But then it gets confused by

"dir" file.
The inability to read the unmangled, and unglobbed string (and those
are two separate requirements, although they do sometimes overlap), is
a significant PITA on *nix.

But what you have is the the ability to read the un-mangled-in-any-way
null-terminated argument string which was prepared by the parent process!!!

At the shell level, you can quote in reliable ways to pass any string
you want to the process:

$ my-find-utility '*.foo' # receives *.foo argument

This is done. There are programs that take patterns which are interpreted by
the programs:

# GNU diff: recursively compare two directories, but not .xml files.
$ diff -r --exclude='*.xml' dir-a dir-b
Getting the input tokenized and globbed
is often what applications want, but there are enough exceptions to
make the *nix approach as wrong as the MS approach of leaving it far
too much up to the whim of the application.

But the Unix approach lets us pass a vector of arguments from one process to
another in a completely robust way. The Microsoft approach doesn't.

We can replace the Unix shell with some scripting language in which a fork/exec
can be done using a list of strings, and count on the list of strings being
accurately converted to arguments. We don't have that assurance in the Windows
environment.
The MS approach at least
has the advantage of being able to support any desired input syntax
(ignoring piping and command stacking), which the *nix approach does
not.

This is only a theoretical advantage in Windows, whereas it is actually
*done* in Unix.

A primary example of this is the standard awk language, which is used for
writing one-liners whereby the entire script is presented as a command line
argument, often wrapped using single quotes:

$ awk 'BEGIN { foo = 42 } /xyzzy/ { s[$1]++; } ...'

What makes this sort of thing possible is that there is a solid foundation
underneath. A program can call awk using fork/exec, and not have to worry
about quoting and escaping issues:

execlp("awk", "BEGIN ....", (char *) 0);

argv[0] is "awk" and argv[1] is a null-terminated string which awk takes to be
script source code, not messed up in any way between the exec call and awk.

In the end, it is the Unix environment which exhibits languages that can take
an entire script with arbitrary syntax as a command line argument, thanks
in part to clear quoting rules and robust argument passing between programs.
 
B

BartC

As for command line options, I would prefer that the time machine be used
more
productively: please convince Microsoft that the command line passed to
child
programs should be broken into arguments *by the OS*, so that applications
receive some kind of array of strings, and not one big string, which
every program parses in an ad-hoc way according to its own conventions.
If
that could be done, I will ask no more.

How many arguments? Because instead of a small string, you could have a
*huge* array when it expands an argument such as "*.c" into thousands of
individual filenames, if it works like certain non-Windows operating
systems.
 
G

glen herrmannsfeldt

(snip, I wrote)
The binder (which replaced the linker a couple of decades ago) no
longer produces a 80 byte punch card compatible format, but the system
still supports old-style load modules.

The LRECL=80 is supposed to be input to the linkage editor, loader,
or binder, and output of compilers.

The OS/360 Linkage Editor is unusual, though, in its ability to
read its own output. System libraries were traditionally load
module (linker output) and not object module (linker input).
And you can sometimes unbury
the old linker and use that if you really want, although it may be
that these days you need to copy it from an older system. The linker
has significant limitations, including the inability to deal with long
names.

Yes, PL/I allows longer names. The generated external names in the
olden days were, the first and last characters, I believe first four
and last three. The compilers have to be able to generate more than
one CSECT, so need one extra character to distinguish them.

I believe the binder accepts both old and new style object modules,
and also old and new load modules.

-- glen
 
J

James Kuyper

On Mon, 24 Mar 2014 00:52:21 +0000 (UTC), Kaz Kylheku


Not sure what you mean here. Microsoft C (since 5.1) has always
provided the command line as an "array of strings". ...

It's been a long time since I used Microsoft C, so long that my
experience includes versions earlier than 5.1, and I'm pretty sure they
did the same. The argc/argv interface for main() was already well
established when "The C Programming Language" was first published.
However, that's irrelevant - you're talking about the compiler; he's
talking about the OS. A conforming implementation of C must support the
argc/argv interface, though not necessarily in any meaningful sense. For
example, I remember that under VMS, special actions needed to be taken
to set up a program so that it could actually take command line
arguments, though I thankfully no longer remember the details. IIRC, it
was necessary to tell VMS something about how many command line
arguments could be passed, and some details about about the syntax for
each one.

....
There is a Windows API call GetCommandLine() that appears to be a
'single' string, ...

That, on the other hand, is very relevant, though providing the command
line arguments as a single string goes all the way at least back to DOS.
 
J

James Kuyper

On 03/24/2014 02:50 AM, Robert Wessel wrote:
....
The inability to read the unmangled, and unglobbed string (and those
are two separate requirements, although they do sometimes overlap), is
a significant PITA on *nix.

I've been using Unix-like systems for 30 years now. I've seldom needed
to pass a command line argument to a program that would be subject to
modification by the shell, when I didn't want it to be, and it's quite
trivial to quote the string to prevent those modifications on the rare
occasions when I've needed it to prevent them. I find file name globbing
to be extremely convenient in many contexts, and I found the MS
work-arounds for the lack of globbing to be quite inconvenient.
 
J

James Kuyper

How many arguments? Because instead of a small string, you could have a
*huge* array when it expands an argument such as "*.c" into thousands of
individual filenames, if it works like certain non-Windows operating
systems.

An array of thousands of pointers to strings containing file names is
not going to be much larger than a single string containing all of those
file names. If you used C, rather than one of your own languages, it's
an array that would have to be constructed anyway, just by the C startup
code, rather than by the OS. I'm agnostic on the question of whether the
OS should create such an array - but I don't see the size of that array
as being a serious problem except in embedded systems with tight memory
constraints.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,075
Messages
2,570,549
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top