I think I get it. String literals and variables just contain strings
of bytes,
No. Perl strings do not consist of bytes. Since there is no official
name for the thingies a perl string is made of, I'll just call them
"thingies".
On the most abstract level, about the only thing we know about these
thingies is that they are numbered: You get the number of the first
thingy in a string with ord() and you can create a string containing
only a single thingy with a specific number with chr(). The numbers
range from 0 .. 2**32-1.
What these thingies *mean* depends on your program. They might be
characters, they might be bytes of a graphics file, they might be
indexes, ... Perl mostly doesn't care.
Perl has two ways of storing strings: If all the thingies have numbers
below 256, the string can be stored as one thingy per byte. If this is
not the case, the thingies are encoded in UTF-8. Theoretically you
shouldn't know or care how perl stores a string.
In reality, Perl does assign some meaning to the type of the string. If
a string is utf8-encoded, Perl assumes that the thingies are really
Unicode-Codepoints. so "\x{FC}" matches /\w/ if it happens to be an
utf8-encoded string, but doesn't if it's a byte-encoded string (I'm
ignoring locales for now). For this reason the utf8-encoded strings are
often called "character strings" and the byte-encoded strings are called
"byte strings".
Since files consist of bytes, you can always only read bytes from a file
and write bytes to it. So when you read a file and want to treat it as a
series of characters instead of bytes, you have to "decode" it, and when
you have a character string which you want to write to a file, you have
to "encode" it. You can do that with the subs from the "Encode" module
or with I/O layers, and Modules written to deal with specific file
formats (like XML) do that automatically.
Now I'm surprised that the following dippy little tag-stripping
program, which is XML-unaware and has no settings whatever relating to
encoding, works.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $i);
while (@ARGV) {
$file = shift(@ARGV);
open(F, "<", $file);
$i = 0;
while ($line = <F>) {
$i++;
chomp($line);
$line =~ s!<[^>]+>!*!g;
print($file . " > " . $line . "\n");
last if ($i > 11);
}
close(F);
}
When I run this over my UTF-8 XML files, I get correct-looking, mixed
Cyrillic and Roman output, with no warnings --- why?
Because UTF-8 is designed in such a way that this should work
.
Your program reads and writes the files as a series of bytes. If your
file contains a cyrillic character, for example "Б", it will read and
write two bytes (0xD0 0x91) instead. Since that happens both on input
and on output, it doesn't matter. If you treat the individual bytes of a
multibyte character as characters, then your program will break. For
example, if you want to insert a blank before each character and put a
$line =~ s!(.)| $1|g;
in your program it won't work because it converts the byte sequence
0xD0 0x91 into the byte sequence 0x20 0xD0 0x20 0x91, which is not a
proper UTF-8 sequence. You must properly decode your input and encode
your output if you want to do this (or deal with the encoding in your
code).
hp