W
Wes Groleau
I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):
#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;
while (<>)
{
print if ( /ñ/ )
}
What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.
The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added
use encoding "utf8";
and ran it again, getting only:
Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.
?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.
Have I found a bug in perl or is my ignorance just getting
the best of me?
Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.
WWG
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):
#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;
while (<>)
{
print if ( /ñ/ )
}
What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.
The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added
use encoding "utf8";
and ran it again, getting only:
Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.
?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.
Have I found a bug in perl or is my ignorance just getting
the best of me?
Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.
WWG