Apparent bug in Perl 5.10 regexes w. UTF-8 expression

B

Ben Bullock

I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

In the following code snippet,

#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua::JA::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);

the output is like this:

ben ~ 541 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
ABCDEFG
now try again

ABCDEFG
ben ~ 542 $

But, if I

use utf8;

and call the routine with a non-ascii string, like å¹³æˆ, I don't get the
error messages.

What's more, after about one hour of exhaustive checking, I'm fairly sure
that there is no uninitialized value in the pattern match in question. In
fact I can remove the error message by removing a variable which is
initialized, called $kanjidigits, from the pattern match, but that seems
even more weird.

I think the above-described behaviour, regardless of any errors in the
module, indicates an error in Perl. Also, I think there is nothing wrong
with the module. Does anybody have any other opinions?
 
P

Peter J. Holzer

I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

In the following code snippet,

#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua::JA::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);

the output is like this:

ben ~ 541 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531. [...]
What's more, after about one hour of exhaustive checking, I'm fairly sure
that there is no uninitialized value in the pattern match in question.

Right. Your problem can be reproduced with this script:

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

hp
 
B

Ben Morrow

Quoth "Peter J. Holzer said:
I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

Right. Your problem can be reproduced with this script:

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";

Using utf8 in regexen is not well-supported in 5.8; in particular, the
regex engine is not consistent about when to apply utf8 semantics and
when to apply byte semantics. Some of the bugs have been fixed in 5.10;
I don't know if they all have.

Ben
 
B

Ben Bullock

Quoth "Peter J. Holzer said:
I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

Right. Your problem can be reproduced with this script:

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x
{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}
\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x
{516b}\x{4e09}]*)\\s*\x{5e74}";

Using utf8 in regexen is not well-supported in 5.8; in particular, the
regex engine is not consistent about when to apply utf8 semantics and
when to apply byte semantics. Some of the bugs have been fixed in 5.10;
I don't know if they all have.

The problem I described is the behaviour of Perl 5.10:

ben ~ 501 $ perl --version

This is perl, v5.10.0 built for i686-linux

Copyright 1987-2007, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

ben ~ 502 $ ben ~ 502 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

etc.

Should I report this as a bug?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,231
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top