UTF-8 in regexp with 5.8.1

W

Wes Groleau

I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):
#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;

while (<>)
{
print if ( /ñ/ )
}

What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added

use encoding "utf8";

and ran it again, getting only:
Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

Have I found a bug in perl or is my ignorance just getting
the best of me?

Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.

WWG
 
A

Alan J. Flavell

I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8.

Well, your whole report stands or falls by that "AFAIK", so it might
be useful to have a test case, including data, which we could run for
ourselves (preferably on a web page, to exclude any possibility of
lossage in usenet postings) to help pin-down your problem.
I also have a perl script in UTF-8,

Noted, although I don't see any compelling reason to code the script
itself in utf-8. Sure, you /can/ do, but it seems to me to be a
potential additional complication that one could do well to avoid
when feasible.
#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)

Do you have a cite on that? My knowledge of this area is admittedly
somewhat limited, but I hadn't met this before.
What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

Sounds good. That even seems to have worked in your usenet posting,
as far as I can see.
use encoding "utf8";

and ran it again, getting only:
Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!?
Bizarre.

According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

I've successfully processed utf-8 and utf-16 data without the use of
the -C flag(s), by using explicit binmode() on the relevant files.

If you could at least get one working variant of your script, you
could then at least move forward from there.

Sorry, this is a bit inconclusive, as yet.
 
W

Wes Groleau

Alan said:
Well, your whole report stands or falls by that "AFAIK", so it might

Well, I told my editor to save it as UTF-8, and I think it works.
(When I save web pages that way, and specify UTF-8 in a META tag,
Spanish, French, Polish, and Japanese characters are correctly
rendered by most browsers.)
Noted, although I don't see any compelling reason to code the script
itself in utf-8. Sure, you /can/ do, but it seems to me to be a
potential additional complication that one could do well to avoid

Well, in this case, I am trying to regexp a non-ASCII character.
Since I am an easily-distracted (A.D.D.) type, and I work with
several different character sets, I am attempting to standardize
on UTF-8 rather than constantly be debugging places where I forgot
to make a switch. :)
Do you have a cite on that? My knowledge of this area is admittedly
somewhat limited, but I hadn't met this before.

Oh, I reported that a while back. If I take the space out
on Mac OS X, I get frequent segment violations. If I remove
the space on NetBSD/Alpha, I get consistent nasty-grams about
the wrong method of invoking the debugger.
I've successfully processed utf-8 and utf-16 data without the use of
the -C flag(s), by using explicit binmode() on the relevant files.

I tried a couple of things with binmode that also didn't work,
but I don't remember exactly what happened.
If you could at least get one working variant of your script, you
could then at least move forward from there.

Sorry, this is a bit inconclusive, as yet.

Well, a post in another thread made me try removing the
"use utf8" and it worked. So, I really think this is
a bug:

- A regexp containing a non-ASCII character in
correct UTF-8 encoding works.

- Add "use utf8" and it silently stops working.

- Add 'use encoding "utf8"' and you get chewed out
for having invalid UTF-8, in a message that bitches
about the presence of bytes that don't exist.

I'll send it in .....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top