how exactly do binary files work in python?

J

John Salerno

In C#, writing to a binary file wrote the actual data types into the
file (integers, etc.). Is this not how Python binary files work? I tried
to write integers into a file, but the write method only takes a string
argument anyway.

Is there a way to actually store integers in a file, so that they can be
read and used (added, compared, etc.) as integers?
 
E

Erik Max Francis

John said:
In C#, writing to a binary file wrote the actual data types into the
file (integers, etc.).

This was inherently nonportable.
Is this not how Python binary files work? I tried
to write integers into a file, but the write method only takes a string
argument anyway.

Is there a way to actually store integers in a file, so that they can be
read and used (added, compared, etc.) as integers?

You can use the struct module for converting fundamental types to a
portable string representation for writing to binary files. Since
you're dealing with a high-level language, you can also just use the
pickle module for a more general form of serialization and persistence.
 
J

John Salerno

Erik said:
You can use the struct module for converting fundamental types to a
portable string representation for writing to binary files.

But if it's a string, why not just use a text file? What does a binary
file do that a text file doesn't, aside from not converting the end of
line characters?
 
G

Grant Edwards

But if it's a string, why not just use a text file?

Because string != text.

In Python a "string" is just an arbitrary length chunk of bytes.
What does a binary file do that a text file doesn't, aside
from not converting the end of line characters?

Nothing. It's the end-of-line conversion that can break binary
data.
 
A

Alex Martelli

Grant Edwards said:
Nothing. It's the end-of-line conversion that can break binary
data.

I believe that a "control-Z" (ord(26)) in a file that's being read as
text, on Windows, is also taken as an end-of-file indication.


Alex
 
S

Steven D'Aprano

But if it's a string, why not just use a text file? What does a binary
file do that a text file doesn't, aside from not converting the end of
line characters?

Nothing. It is all bytes under the hood.

People generally consider a file to be "text" if it only includes bytes 32
through 126, plus a few control characters like 9 (tab) and 10 (newline).
Other applications don't care what bytes are included. Python is (mostly)
like that: you can deal with any bytes you collect from any file.

Other than this informal difference between text an binary, the major
difference comes about when you read lines from a text file. Each
operating system has a line separator: Unix/BSD/Linux systems use newline
(char 10), classic Macintosh used to use carriage return (char 12) and
DOS/Windows uses a two-byte carriage return + newline.

When writing lines to a file, Python does not automatically append the
line marker, so you need to do so yourself. But some other languages do --
I believe C++ is one of those languages. So C++ needs to know whether you
are writing in text mode so it can append that end-of-line maker, or
binary mode so it doesn't. Since Python doesn't modify the line you write
to the file, it doesn't care whether you are writing in text or binary
mode, it is all the same.

Operating systems such as Unix and Linux don't distinguish between binary
and text mode, the results are the same. I'm told that Windows does
distinguish between the two, although I couldn't tell you how they
differ.
 
G

Grant Edwards

I believe that a "control-Z" (ord(26)) in a file that's being read as
text, on Windows, is also taken as an end-of-file indication.

Ah yes. IIRC, that's left over from CP/M, where the filesystem
didn't keep a file length for files other than a block count.
It was up to the application(s) to keep track of where in that
last block the "real" data ended.
 
S

Scott David Daniels

Steven D'Aprano wrote:
[Generally fine stuff, I am elaborating rather than dis-agreeing.]
Nothing. It is all bytes under the hood.
Modeling a file as "a continuous undifferentiated string of bytes under
the hood" is a Unix-ism. There were (and are) other models.
When writing lines to a file, Python does not automatically append the
line marker, so you need to do so yourself.
This is, indeed the behavior with "write," but not with "print"
A "print" statement ending w/o a comma will tack an end-of-line onto its
output.
> But some other languages do -- I believe C++ is one of those languages.
> So C++ needs to know whether you are writing in text mode so it can
> append that end-of-line maker, or binary mode so it doesn't.
Actually C++ (and C) convert any ('\12' == '\n' == LF) character to
the local file system's "line terminator" character on output to a
text-mode file.
> Since Python doesn't modify the line you write to the file, it doesn't
> care whether you are writing in text or binary mode, it is all the same.
Well, actually CPython uses C I/O, so it does convert the '\n' chars
just as C does.
Operating systems such as Unix and Linux don't distinguish between binary
and text mode, the results are the same. I'm told that Windows does
distinguish between the two, although I couldn't tell you how they
differ.

The way Windows differs from Unix:
If the actual file data is built as:
f = open('dead_parrot', 'wb')
f.write('dead\r\nparrot')
f.close()
g = open('ex_parrot', 'w')
g.write('Dead\nParrot')
g.close()
ft = open('dead_parrot', 'r')
ft.read(6) returns 'dead\np'
gt = open('ex_parrot', 'r')
gt.read(6) returns 'Dead\nD'

fb = open('dead_parrot', 'rb')
fb.read(6) returns 'dead\r\n'
gb = open('ex_parrot', 'rb')
gb.read(6) returns 'Dead\r\n'

In case you didn't follow the above too precisely, both files
(dead_parrot and ex_parrot) have exactly the same byes as contents.

This, by the way, is one of the few places Windows did it "by the
standard" and Unix "made up their own standard." The Unix decision
was, essentially: "there are too many ways to get in trouble with
both CR and LF determining line ending: what do you do for LF-CR pairs,
What does a LF by itself mean w/o a CR, .... Let's just treat LF
as a single-character line separator." Note how funny this for how
you type: you type <a> <b> <c> <Enter> for a line, but <Enter> sends
a CR ('\r' == '\15' == ASCII 13), which the I/O systems somewhere
magically transforms into a LF ('\n' == '\12' == ASCII 10).

The C standard (which evolved with Unix) does these translation
"for you" (or "to you" depending on your mood) because it was meant
to be compatible with _many_ file systems, including those which did
not explicitly represent ends-of-lines (text files are such systems
are sequences of lines, and there is a maximum length to each line).
By the way, before you think such systems are foolish, think about
how nice it might sometimes be to get to line 20972 of a file without
reading through the entire front of the file.

--Scott David Daniels
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,290
Messages
2,571,453
Members
48,129
Latest member
DianneCarn

Latest Threads

Top