D
Dale
Consider the following:
my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
print "$1 $2\n";
}
I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Фибоначчи (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:
$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Фибоначчи");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}
This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?
A couple details:
Perl version:
5.8.8
Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");
my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
print "$1 $2\n";
}
I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Фибоначчи (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:
$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Фибоначчи");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}
This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?
A couple details:
Perl version:
5.8.8
Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");