Problems with use locale and regexp

felix.ostmann · Dec 29, 2006

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.
after 69 articles my importprocess stops ... but he use many cpu-time
.... strange.

after some hours i found out that this small code can reproduce the
error. he cant execute the pattern.

without "use locale;" it works. With the second regexp ist works!
(search for only one \] at the end of the line).

I cant believe ... i think he must find out after "[[lmo:" that the
string dont match, why is the \]\] at the end so basic?

Why this affect only after "use locale;"? i set the locale to
POSIX,C,en_GB or de_DE, nothing

wont work!

What is to do?

felix.ostmann · Dec 29, 2006

my $content = Encode::decode("iso-8859-15"
,"[[lmo:Met\xE0jalcal\xEDtt]]\n");

sorry for the newline in the code

anno4000 · Jan 2, 2007

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.

You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.

BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.

Anno

Mumia W. (on aioe) · Jan 2, 2007

It is realy strange!

first the code:
#####################################
#!/usr/bin/perl

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");

$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK

print $content;
#####################################

This bug? shocked me when i was parsing wikipedia-data.

Click to expand...

You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.

BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.

Anno

It looks like a bug in the interpreter to me. Perl locks up when it sees
the substitution operator. I wanted to create a better set of test
cases, but I got distracted by more pressing things:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

printf "Perl %vd\n", $^V;

local $\ = "\n";
my $car = "[[taurus:vehicle\xE0^^]]\n";
print enhex($car);
$car = Encode::decode('iso-8859-15',$car);
print enhex($car);
# $car =~ s/^\[\[[a-z]{2}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{2}:.*\][]]$/substituted/gm; # DOES NOT LOCK UP
# $car =~ s/^\[\[[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{6}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP (MATCH)
# $car =~ s/^\[\[[a-z]{5}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:.*\]\Q]\E$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:[^]]+\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^..[a-z]{2}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP
$car =~ s/^\[{2}[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
print $car;

sub enhex {
my $data = $_[0];
$data = unpack 'H*', $data;
$data =~ s/(..)/$1 /g;
$data;
}

__END__

These things seem to be required to evoke the bug:

The locale module must be used. The Encode module must be used to decode
(tested only with iso-8859-15). Two instances of \] or something similar
must appear in the regex. The regex must fail to match.

It's too bad I don't have any more time for this.

felix.ostmann · Jan 3, 2007

is there any good solution at this time?

$content =~ s!^\[\[[a-z]{2}:.*.\]$!!gm; ## "\]\]" => ".\]"

Thanks for your help, what is the correct handling when i found
something like this? where is the perl-bug-tracking (or so). I never
use such a system bevor

It is realy strange!
first the code:
#####################################
#!/usr/bin/perl
use strict;
use warnings;
use locale; ## WONT WORK
use Encode;
my $content = Encode::decode("iso-8859-15","[[lmo:Met\xE0j
alcal\xEDtt]]\n");
$content =~ s!^\[\[[a-z]{2}:.*\]\]$!!gm; ## WONT WORK
# $content =~ s!^\[\[[a-z]{2}:.*\]$!!gm; ## WORK
print $content;
#####################################
This bug? shocked me when i was parsing wikipedia-data.

Click to expand...

Click to expand...

You don't say what "WONT WORK" actually means. Apparently the regex
match /^\[\[[a-z]{2}:.*\]\]/m never returns with your string and
locale in effect. That certainly is a bug and if it's still in the
newest bleadperl (it still is in v5.9.4 DEVEL28658) it ought to be
reported.

Click to expand...

BTW, I assume you are aware that you regex doesn't match (it shouldn't
loop endlessly, though). The matching regex /^\[\[[a-z]{3}:.*\]\]/m
doesn't show the problem.

Click to expand...

AnnoIt looks like a bug in the interpreter to me. Perl locks up when it sees

Click to expand...

the substitution operator. I wanted to create a better set of test
cases, but I got distracted by more pressing things:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
use locale; ## WONT WORK
use Encode;

printf "Perl %vd\n", $^V;

local $\ = "\n";
my $car = "[[taurus:vehicle\xE0^^]]\n";
print enhex($car);
$car = Encode::decode('iso-8859-15',$car);
print enhex($car);
# $car =~ s/^\[\[[a-z]{2}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{2}:.*\][]]$/substituted/gm; # DOES NOT LOCK UP
# $car =~ s/^\[\[[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{6}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP (MATCH)
# $car =~ s/^\[\[[a-z]{5}:.*\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:.*\]\Q]\E$/substituted/gm; # LOCKS UP
# $car =~ s/^\[\[[a-z]{5}:[^]]+\]\]$/substituted/gm; # LOCKS UP
# $car =~ s/^..[a-z]{2}:.*\]\]$/substituted/gm; # DOES NOT LOCK UP
$car =~ s/^\[{2}[a-z]{2}:.*\]{2}$/substituted/gm; # LOCKS UP
print $car;

sub enhex {
my $data = $_[0];
$data = unpack 'H*', $data;
$data =~ s/(..)/$1 /g;
$data;

}__END__

These things seem to be required to evoke the bug:

The locale module must be used. The Encode module must be used to decode
(tested only with iso-8859-15). Two instances of \] or something similar
must appear in the regex. The regex must fail to match.

It's too bad I don't have any more time for this.

anno4000 · Jan 4, 2007

is there any good solution at this time?

Solution to what? Please don't top-post.

$content =~ s!^\[\[[a-z]{2}:.*.\]$!!gm; ## "\]\]" => ".\]"

Thanks for your help, what is the correct handling when i found
something like this? where is the perl-bug-tracking (or so). I never
use such a system bevor

[tofu snipped]

perldoc perlbug

Anno

try to use "locale" with german and coepage 437 (dos)	2	Mar 8, 2008
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Problems with utf8, locale and regex	0	Dec 5, 2007
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
problems with editable JList	8	Feb 17, 2014
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Locale/UTF-8 file path with std::ifstream	2	Feb 8, 2008
Using time_put and time_get from <locale>.	6	Jan 14, 2006

Problems with use locale and regexp

felix.ostmann

felix.ostmann

anno4000

Mumia W. (on aioe)

felix.ostmann

anno4000

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads