B
bill_mckinnon
I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:
--
#!/usr/local/bin/perl -w
use Encode qw(decode);
$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--
Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:
$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$
Granted, what I'm trying to do is to match the literal utf8 bytes
for a Unicode character against a Unicode string, which may not be a
reasonable thing to do. But the way this fails doesn't make any sense
to me; I don't have a null byte after (or before) the \xc3 byte in my
regex. Also, if the regex string was being upgraded to Unicode
(presumably from iso-latin-1) I can see it not doing what I intended,
but this shouldn't cause this error; it should just not match the way I
want. And then if the \x sequences were taken to be code points instead
of literal bytes then that's fine...it may not do what I want, but it
still shouldn't cause this warning.
Does anyone know why this warning is coming up? It makes me think
there's more going on under the surface than just an extra iso-latin-1
-> utf8 conversion. Thanks in advance for any insight.
- Bill
P.S. - I can do the match I want by using the results of
encode('utf8', $s) to do the match; since it's a byte
string everything works fine. But I want to understand
what the issue was with the warning.
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:
--
#!/usr/local/bin/perl -w
use Encode qw(decode);
$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--
Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:
$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$
Granted, what I'm trying to do is to match the literal utf8 bytes
for a Unicode character against a Unicode string, which may not be a
reasonable thing to do. But the way this fails doesn't make any sense
to me; I don't have a null byte after (or before) the \xc3 byte in my
regex. Also, if the regex string was being upgraded to Unicode
(presumably from iso-latin-1) I can see it not doing what I intended,
but this shouldn't cause this error; it should just not match the way I
want. And then if the \x sequences were taken to be code points instead
of literal bytes then that's fine...it may not do what I want, but it
still shouldn't cause this warning.
Does anyone know why this warning is coming up? It makes me think
there's more going on under the surface than just an extra iso-latin-1
-> utf8 conversion. Thanks in advance for any insight.
- Bill
P.S. - I can do the match I want by using the results of
encode('utf8', $s) to do the match; since it's a byte
string everything works fine. But I want to understand
what the issue was with the warning.