Assigning another filehandle to STDOUT, using binmode.

A

Adam Funk

No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.

I think I get it. String literals and variables just contain strings
of bytes, and encoding is a consideration only for input and output
--- or is that only for output?

Now I'm surprised that the following dippy little tag-stripping
program, which is XML-unaware and has no settings whatever relating to
encoding, works.


#!/usr/bin/perl

use strict;
use warnings;

my ($file, $line, $i);

while (@ARGV) {
$file = shift(@ARGV);
open(F, "<", $file);
$i = 0;
while ($line = <F>) {
$i++;
chomp($line);
$line =~ s!<[^>]+>!*!g;
print($file . " > " . $line . "\n");
last if ($i > 11);
}
close(F);
}


When I run this over my UTF-8 XML files, I get correct-looking, mixed
Cyrillic and Roman output, with no warnings --- why?
 
P

Peter J. Holzer

I think I get it. String literals and variables just contain strings
of bytes,

No. Perl strings do not consist of bytes. Since there is no official
name for the thingies a perl string is made of, I'll just call them
"thingies".

On the most abstract level, about the only thing we know about these
thingies is that they are numbered: You get the number of the first
thingy in a string with ord() and you can create a string containing
only a single thingy with a specific number with chr(). The numbers
range from 0 .. 2**32-1.

What these thingies *mean* depends on your program. They might be
characters, they might be bytes of a graphics file, they might be
indexes, ... Perl mostly doesn't care.

Perl has two ways of storing strings: If all the thingies have numbers
below 256, the string can be stored as one thingy per byte. If this is
not the case, the thingies are encoded in UTF-8. Theoretically you
shouldn't know or care how perl stores a string.

In reality, Perl does assign some meaning to the type of the string. If
a string is utf8-encoded, Perl assumes that the thingies are really
Unicode-Codepoints. so "\x{FC}" matches /\w/ if it happens to be an
utf8-encoded string, but doesn't if it's a byte-encoded string (I'm
ignoring locales for now). For this reason the utf8-encoded strings are
often called "character strings" and the byte-encoded strings are called
"byte strings".

Since files consist of bytes, you can always only read bytes from a file
and write bytes to it. So when you read a file and want to treat it as a
series of characters instead of bytes, you have to "decode" it, and when
you have a character string which you want to write to a file, you have
to "encode" it. You can do that with the subs from the "Encode" module
or with I/O layers, and Modules written to deal with specific file
formats (like XML) do that automatically.

Now I'm surprised that the following dippy little tag-stripping
program, which is XML-unaware and has no settings whatever relating to
encoding, works.


#!/usr/bin/perl

use strict;
use warnings;

my ($file, $line, $i);

while (@ARGV) {
$file = shift(@ARGV);
open(F, "<", $file);
$i = 0;
while ($line = <F>) {
$i++;
chomp($line);
$line =~ s!<[^>]+>!*!g;
print($file . " > " . $line . "\n");
last if ($i > 11);
}
close(F);
}


When I run this over my UTF-8 XML files, I get correct-looking, mixed
Cyrillic and Roman output, with no warnings --- why?

Because UTF-8 is designed in such a way that this should work :).

Your program reads and writes the files as a series of bytes. If your
file contains a cyrillic character, for example "Б", it will read and
write two bytes (0xD0 0x91) instead. Since that happens both on input
and on output, it doesn't matter. If you treat the individual bytes of a
multibyte character as characters, then your program will break. For
example, if you want to insert a blank before each character and put a

$line =~ s!(.)| $1|g;

in your program it won't work because it converts the byte sequence
0xD0 0x91 into the byte sequence 0x20 0xD0 0x20 0x91, which is not a
proper UTF-8 sequence. You must properly decode your input and encode
your output if you want to do this (or deal with the encoding in your
code).

hp
 
A

Adam Funk

Because UTF-8 is designed in such a way that this should work :).

Your program reads and writes the files as a series of bytes. If your
file contains a cyrillic character, for example "?", it will read and
write two bytes (0xD0 0x91) instead. Since that happens both on input
and on output, it doesn't matter. If you treat the individual bytes of a
multibyte character as characters, then your program will break. For
example, if you want to insert a blank before each character and put a

$line =~ s!(.)| $1|g;

in your program it won't work because it converts the byte sequence
0xD0 0x91 into the byte sequence 0x20 0xD0 0x20 0x91, which is not a
proper UTF-8 sequence. You must properly decode your input and encode
your output if you want to do this (or deal with the encoding in your
code).

I think I'm getting this. Thanks!
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:

True, but not the answer to Adam's question. Not every perl string is
a perl text string. Strings can be used to store non-textual
information.

You should read more carefully, I wrote "A Perl *text* string". The
concept is further defined in perlunitut.
Together it is a complete answer to Adam's question.
 
P

Peter J. Holzer

Peter J. Holzer schreef:

You should read more carefully, I wrote "A Perl *text* string".

I did read this. That's why I wrote "Not every perl string is a perl
*text* string" (emphasis added). Adam asked about "String literals and
variables". While some point can be made that string literals are
supposed to always contain text strings, that certainly isn't true about
variables.
The concept is further defined in perlunitut.

Perlunitut is good reading. If you had just recommended that Adam should
read this, I wouldn't have objected. But your first sentence was IMHO
missing the point and possibly misleading.

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:

I did read this. That's why I wrote "Not every perl string is a perl
*text* string" (emphasis added).

This is getting ridiculous. I wrote "Perl text string", and you reacted
on something you call "every perl string", which I didn't write. (Adam
is dealing with Perl text strings, or he should be.)

I was not talking about "every perl string", I was specifically
isolating the "Perl text string"-type-of-Perl-string, by explicitely
referring to it as "Perl text string", in an introduction to (so related
to) perlunitut. There was, contrary to what you read into it, nothing
incomplete about it.
Yes it assumes that you actually read perlunitut, which is easy to read
and understand, but why would I have ordered "See perlunitut" otherwise?
Should I maybe have written "Read and follow perlunitut" in stead of
"See perlunitut" for you to get the picture?

See also `perldoc Encode`, it defines all strings in Perl as sequences
of characters (and binary strings as just a subset of Perl strings),
which is different from how perlunitut projects it.
 
A

Adam Funk

Peter J. Holzer schreef:

This is getting ridiculous. I wrote "Perl text string", and you reacted
on something you call "every perl string", which I didn't write. (Adam
is dealing with Perl text strings, or he should be.)

I hope you don't mind if I butt in here ;-) to say that you've *both*
given me very helpful and informative replies!

To be fair, perlunitut does deal with both kinds of strings but it
clarifies that there are two different kinds, and I figured out that I
was interested in text strings.

See also `perldoc Encode`, it defines all strings in Perl as sequences
of characters (and binary strings as just a subset of Perl strings),
which is different from how perlunitut projects it.

That sounds interesting; I'll take a look at that too.

Thanks (to both of you),
Adam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top