C
Chris Angelico
But a *text file* is a concatenation of lines. The "text file" model is
important enough that nearly all programming languages offer a line-based
interface to files, and some (Python at least, possibly others) make it
the default interface so that iterating over the file gives you lines
rather than bytes -- even in "binary" mode.
And lines are delimited entities. A text file is a sequence of lines,
separated by certain characters.
There is: call strip('\n') on the line after reading it. Perl and Ruby
spell it chomp(). Other languages may spell it differently. I don't know
of any language that automatically strips newlines, probably because you
can easily strip the newline from the line, but if the language did it
for you, you cannot reliably reverse it.
That's not a tidy way to iterate, that's a way to iterate and then do
stuff. Compare:
for line in f:
# process line with newline
for line in f:
line = line.strip("\n")
# process line without newline, as long as it doesn't have \r\n or something
for line in f:
line = line.split("$")
# process line as a series of dollar-delimited fields
The second one is more like the third than the first. Python does not
offer a tidy way to do the common thing, which is reading the content
of the line without its terminator.
I have no problem with that: when interpreting text as a record with
delimiters, e.g. from a CSV file, you normally exclude the delimiter.
Sometimes the line terminator does double-duty as a record delimiter as
well.
So why is the delimiter excluded when you treat the file as CSV, but
included when you treat the file as lines of text?
Reading from a file is considered a low-level operation. Reading
individual bytes in binary mode is the lowest level; reading lines in
text mode is the next level, built on top of the lower binary mode. You
build higher protocols on top of one or the other of that mode, e.g.
"read a zip file" would be built on top of binary mode, "read a csv file"
would be built on top of text mode.
I agree that reading a binary file is the lowest level. Reading a text
file is higher level, but to me "reading a text file" means "reading a
binary file and decoding it into Unicode text", and not "... and
dividing it into lines". Bear in mind that reading a CSV file can be
built on top of a Unicode decode, but not on a line-based iteration
(in case there are newlines inside quotes).
As a low-level protocol, you ought to be able to copy a file without
changing it by reading it in then writing it out:
for blob in infile:
outfile.write(blob)
ought to work whether you are in text mode or binary mode, so long as the
infile and outfile are opened in the same mode. If Python were to strip
newlines, that would no longer be the case.
All you need is a "writeln" method that re-adds the newline, and then
it's correctly round-tripping, based on what you've already stated
about the file: that it's a series of lines of text. It might not be a
byte-equivalent round-trip if you're changing newline style, any more
than it already won't be for other reasons (file encoding, for
instance). By reading the file as a series of Unicode lines, you're
declaring that it contains lines of Unicode text, not arbitrary bytes,
and so a valid representation of those lines of Unicode text is a
faithful reproduction of the file. If you want a byte-for-byte
identical file, open it in binary mode to do the copy; that's what we
learn from FTPing files between Linux and Windows.
(Even high-level protocols should avoid unnecessary modifications to
files. One of the more annoying, if not crippling, limitations to the
configparser module is that reading an INI file in, then writing it out
again destroys the high-level structure of the file: comments and blank
lines are stripped, and records may be re-ordered.)
Precisely. If you read it as an INI file and then rewrite it as an INI
file, you risk damaging that sort of thing. If you parse a file as a
Python script, and then reconstitute it from the AST (with one of the
unparsers available), you have a guarantee that the result will
execute the exact same code. But it won't be the same file (although
Python's AST does guarantee order, unlike your INI file example).
Actually, this might be a useful transformation to do, sometimes -
part of a diff suite, maybe - if the old and new versions are
identical after an AST parse/unparse transformation, you don't need to
re-run tests, because there's no way a code bug can have been
introduced.
ChrisA