Problems with use locale and regexp

F

felix.ostmann

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.
after 69 articles my importprocess stops ... but he use many cpu-time
.... strange.

after some hours i found out that this small code can reproduce the
error. he cant execute the pattern.

without "use locale;" it works. With the second regexp ist works!
(search for only one \] at the end of the line).

I cant believe ... i think he must find out after "[[lmo:" that the
string dont match, why is the \]\] at the end so basic?

Why this affect only after "use locale;"? i set the locale to
POSIX,C,en_GB or de_DE, nothing :( wont work!

What is to do?
 
F

felix.ostmann

my $content = Encode::decode("iso-8859-15"
,"[[lmo:Met\xE0jalcal\xEDtt]]\n");

sorry for the newline in the code :(
 
A

anno4000

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.

You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.

BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.

Anno
 
M

Mumia W. (on aioe)

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.

You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.

BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.

Anno

It looks like a bug in the interpreter to me. Perl locks up when it sees
the substitution operator. I wanted to create a better set of test
cases, but I got distracted by more pressing things:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

printf "Perl %vd\n", $^V;


local $\ = "\n";
my $car = "[[taurus:vehicle\xE0^^]]\n";
print enhex($car);
$car = Encode::decode('iso-8859-15',$car);
print enhex($car);
# $car =~ s/^\[\[[a-z]{2}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{2}:.*\][]]$/substituted/gm; # DOES NOT LOCK UP
# $car =~ s/^\[\[[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{6}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP (MATCH)
# $car =~ s/^\[\[[a-z]{5}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:.*\]\Q]\E$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:[^]]+\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^..[a-z]{2}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP
$car =~ s/^\[{2}[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
print $car;

sub enhex {
my $data = $_[0];
$data = unpack 'H*', $data;
$data =~ s/(..)/$1 /g;
$data;
}

__END__

These things seem to be required to evoke the bug:

The locale module must be used. The Encode module must be used to decode
(tested only with iso-8859-15). Two instances of \] or something similar
must appear in the regex. The regex must fail to match.

It's too bad I don't have any more time for this.
 
F

felix.ostmann

is there any good solution at this time?

$content =~ s!^\[\[[a-z]{2}:.*.\]$!!gm; ## "\]\]" => ".\]"

Thanks for your help, what is the correct handling when i found
something like this? where is the perl-bug-tracking (or so). I never
use such a system bevor :(

It is realy strange!
first the code:
#####################################
#!/usr/bin/perl
use strict;
use warnings;
use locale; ## WONT WORK
use Encode;
my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");
$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK
print $content;
#####################################
This bug? shocked me when i was parsing wikipedia-data.
You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.
BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.
AnnoIt looks like a bug in the interpreter to me. Perl locks up when it sees
the substitution operator. I wanted to create a better set of test
cases, but I got distracted by more pressing things:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

printf "Perl %vd\n", $^V;

local $\ = "\n";
my $car = "[[taurus:vehicle\xE0^^]]\n";
print enhex($car);
$car = Encode::decode('iso-8859-15',$car);
print enhex($car);
# $car =~ s/^\[\[[a-z]{2}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{2}:.*\][]]$/substituted/gm; # DOES NOT LOCK UP
# $car =~ s/^\[\[[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{6}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP (MATCH)
# $car =~ s/^\[\[[a-z]{5}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:.*\]\Q]\E$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:[^]]+\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^..[a-z]{2}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP
$car =~ s/^\[{2}[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
print $car;

sub enhex {
my $data = $_[0];
$data = unpack 'H*', $data;
$data =~ s/(..)/$1 /g;
$data;

}__END__

These things seem to be required to evoke the bug:

The locale module must be used. The Encode module must be used to decode
(tested only with iso-8859-15). Two instances of \] or something similar
must appear in the regex. The regex must fail to match.

It's too bad I don't have any more time for this.
 
A

anno4000

is there any good solution at this time?

Solution to what? Please don't top-post.
$content =~ s!^\[\[[a-z]{2}:.*.\]$!!gm; ## "\]\]" => ".\]"

Thanks for your help, what is the correct handling when i found
something like this? where is the perl-bug-tracking (or so). I never
use such a system bevor :(

[tofu snipped]

perldoc perlbug

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top