M
Marko Rauhamaa
Steven D'Aprano said:Nevertheless, there are important abstractions that are written on top
of the bytes layer, and in the Unix and Linux world, the most
important abstraction is *text*. In the Unix world, text formats and
text processing is much more common in user-space apps than binary
processing.
That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.
It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.
Case in point:
$ env | grep UTF
LANG=en_US.UTF-8
$ od -c <<<"Hyvää yötä" # "Good night" in Finnish
0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n
0000017
The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.
How about:
$ wc -c <<<"Hyvää yötä"
15
$ tr 'ä' 'a' <<<"Hyvää yötä"
Hyvaaaa ya�taa
Grep is smarter:
$ grep v...y <<<"Hyvää yötä"
Hyvää yötä
which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).
Marko