P
Peter J. Holzer
with said:*SKIP*
I've read these postings but I don't know what you are referring to.
If you are referring to other postings (especially long ones), please
cite the relevant part.
[quoting <[email protected]> on]
$ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
0000000 0ae5
345 \n
0000002
Then I don't understand what you meant by "that" in the quoted
paragraph, since that seemed to refer to something else.
If "you" above refers to me
Yes, of course. You used to the term "utf8", so I was wondering what you
meant by it.
then you're wrong.
Then I don't know what you meant by "utf8". Care to explain?
Try to read it again. Slowly.
Read *what* again? The paragraph you quoted is correct and explains the
behaviour you are seeing.
Indeed, only FLAGS and PV are relevant. Sadly that Devel:eek:ump
doesn't provide means to filter arbitrary parts of output off (however,
that's not the purpose of D:). And I consider editing copypastes a
bad taste.
That's not the problem. The problem is that you gave the output of
Devel:eek:ump which clearly showed a latin-1 character occupying
*two* bytes and then claimed that it was only one byte long. Which it
clearly wasn't. What you probably meant was that the latin1 character
would be only 1 byte long if written to an output stream without an
encoding layer. But you didn't write that. You just made an assertion
which clearly contradicted the example you had just given and didn't
even give any indication that you had even noticed the contradiction.
It's not about understanding. I'm trying to make a point that latin1 is
special.
It is only special in the sense that all its codepoints have a value <=
255. So if you are writing to a byte stream, it can be directly
interpreted as a string of bytes and written to the stream without
modification.
The point that *I* am trying to make is that an I/O stream without an
:encoding() layer isn't for I/O of *characters*, it is for I/O of
*bytes*.
Thus, when you write the string "Käse" to such a stream, you aren't
writing Upper Case K, lower case umlaut a, etc. You are writing 4 bytes
with the values 0x4B, 0xE4, 0x73, 0x65. The I/O-code doesn't care about
whether the string is character string (with the UTF8 bit set) or a byte
string, it just interprets every element of the string as a byte. Those
four bytes could be pixels in image, for all the Perl I/O code knows.
OTOH, if there is an :encoding() layer, the string is taken to be
composed of (unicode) characters. If there is an element with the
codepoint \x{E4} in the string, it is a interpreted as a lower case
umlaut a, and converted to the proper encoding (e.g. one byte 0x84 for
CP850, two bytes 0xC3 0xA4 for UTF-8 and one byte 0xE4 for latin-1). But
again, this happens *always*. The Perl I/O layer doesn't care whether
the string is a character string (with the UTF8 bit set) or not.
Many years ago to get operations to work on characters instead of bytes
some strings must have been pulled. encoding.pm pulled right strings.
utf8.pm pulled irrelevant strings. Those days text related operations
worked for you because they fitted in latin1 script or you didn't hit
edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
worked *only* on bytes, no matter what).
Perl aquired unicode support in its current form only in 5.8.0. 5.6.0
did have some experimental support for UTF-8-encoded strings, but it was
different and widely regarded as broken (that's why it was changed for
5.8.0). So what Perl 5.6.0 did or didn't do is irrelevant for this
discussion.
With some luck I managed to skip the 5.6 days and went directly from the
<=5.005 "bytestrings only" era to the modern >=5.8.0 "character
strings" era. However, in the early days of 5.8.x, the documentation was
quite bad and it took a lot of reading, experimenting and thinking to
arrive at a consistent understanding of the Perl string model.
But once you have this understanding, it is really quite simple and
consistent.
Guess what? I've just figured out I don't need either any more:
{40710:255} [0:0]% xxd foo.koi8-u
0000000: c6d9 d7c1 0a .....
{40731:262} [0:0]% perl -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Wide character in print at -e line 5.
Ñ„Ñ‹
This example doesn't have any non-ascii characters in the source code,
so of course it doesn't need 'use utf8'. The only effect of use utf8 it
to tell the perl compiler that the source code is encoded in UTF-8.
But you *do* need some indication of the encoding of STDOUT (did you
notice the warning "Wide character in print at -e line 5."? As long as
you get this warning, your code is wrong).
You could use "use encoding 'utf-8'":
% perl -wle '
use encoding "UTF-8";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹
Or you could use -C on the command line:
% perl -CS -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹
Or could use "use open":
% perl -wle '
use open ":locale";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹
Note: No warning in all three cases. The latter takes the encoding from
the environment, which hopefully matches your terminal settings. So it
works on a UTF-8 or ISO-8859-5 or KOI-8 terminal. But of course it
doesn't work on a latin-1 terminal and you get an appropriate warning:
"\x{0444}" does not map to iso-8859-1 at -e line 6.
"\x{044b}" does not map to iso-8859-1 at -e line 6.
\x{0444}\x{044b}
It comes clear to me now what made you both (you and Ben) believe in
bugginess of F<encoding.pm>. I'm fine with that.
I don't know whether encoding.pm is broken in the sense that it doesn't
do what is documented to do (it was, but it is possible that all of
those bugs have been fixed). I do think that it is "broken as designed",
because it conflates two different things:
* The encoding of the source code of the script
* The default encoding of some I/O streams
and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR) and finally, because it is too
complex and that will lead to surprising results.
hp