polymorphic regex -- encoding issue

Dale · Oct 18, 2007

Consider the following:

my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸ (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

A couple details:

Perl version:
5.8.8

Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");

Ben Morrow · Oct 18, 2007

Quoth Dale said:
Consider the following:

my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸ (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

Nope, there isn't. All you can do is decode all the separate parts into
bytes, and then ask for a regex that matches by bytes.

At the very least you want a 'use bytes' around that regex and match.
You also need to be aware that perl will be doing a byte-by-byte match,
so if it's possible for part of a character to match (which depends on
the encoding: it is possible with UTF16, but not with UTF8, for
instance. I'm afraid I don't know about cp1251.) you will get false
positives. You also need to be sure that LWP is returning you the page
as bytes, and not trying to be clever and decoding it to UTF8 already. I
presume you already know that.

Unless you have an awful lot of these matches to do (and you know this
is what's slowing you down), it's not worth the bother.

Ben

Dale · Oct 19, 2007

Thanks Ben. The problem is, of course consistency. I want to make
sure, that I also decode '\w' and '\s' so that they match the same
things that they would have matched in the original regex. The perldoc
says one can influence what '\w' matches by using locales. But I
managed to find a consistent translation without using locales (now
I'm answering my own question):

# As before, I search for the word Fibonacci, in CP1251-encoded
Cyrillic
my $search_word = encode("cp1251", "Ð¤Ð¸Ð±Ð¾Ð½Ð°Ñ‡Ñ‡Ð¸");

# CP1251 is an extended ASCII charset in the range 00-FF. Here we
# get this set of characters and decode them into Unicode.
my @cp1251_charset =
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

# Find out which of these characters are matched by '\w' (in Unicode).
my @cp1251_wordchars =
grep(/\w/, @cp1251_charset);

# The matched word characters are put back into CP1251
my $w = encode("CP1251", join("", @cp1251_wordchars));

# We follow the same idea as above for the space characters.
my @cp1251_spacechars =
grep(/\s/, @cp1251_charset);
my $s = encode("CP1251", join("", @cp1251_spacechars));

# Now we just put the pieces together
my $russian_page = get "http://stock.narod.ru/fibo.htm";
while ($russian_page =~ m/($search_word)[$s]([$w]+)/g) {
print decode("cp1251", "$1 $2\n");
}

Details (same as in previous version):

Perl version
5.8.8

modules used
Encode;
LWP::Simple qw(get);
utf8;
binmode(STDOUT, ":utf8");

Note: Why didn't I use setlocale, as the Perldoc suggests? First
reason: Our computers are somehow set up with a very limited range of
possible locales. Second reason: locales are confusing for me. I
prefer to avoid them. I set my environment to en_US.utf8 and I don't
want to think about locales any more after that.

Ilya Zakharevich · Oct 19, 2007

[A complimentary Cc of this posting was sent to
Dale

# CP1251 is an extended ASCII charset in the range 00-FF. Here we
# get this set of characters and decode them into Unicode.
my @cp1251_charset =3D
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

# Find out which of these characters are matched by '\w' (in Unicode).
my @cp1251_wordchars =3D
grep(/\w/, @cp1251_charset);

# The matched word characters are put back into CP1251
my $w =3D encode("CP1251", join("", @cp1251_wordchars));

To baroque, IMO. I would use something like

my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr, 0x00..0xFF;

Your approach has a chance to be quickier, though, but since this
should only run once... [I did not benchmark them.]

Ilya

Dr.Ruud · Oct 20, 2007

Ilya Zakharevich schreef:

my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr,
0x00..0xFF;

Alternative:

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

Dale Gerdemann · Oct 21, 2007

Thanks Ilya and Affijn for your "improvements" but I still like my own
code better, because at least I break it down into commented steps. I
know my comments are minimal, but at least I tried. The reader of my
code is bound to find several things confusing:

my @cp1251_charset =
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

Here are some questions that are bound to arise:

Why "decode CP1251"? How can you see that the input was ever encoded
as CP1251 to begin with? We must be assuming that 'chr' returns
something that can at least be thought of as as CP1251 encoded. But
consider the small test program:

print chr(0xFF);

This may print out ÿ (LATIN SMALL LETTER Y WITH DIAERESIS), a
character that doesn't even exist in CP1251. Of course, it only prints
out this character if you're using "binmode(STDOUT, ":utf8");" or "use
encoding 'utf8';", but you can see that there is plenty of room for
confusion.

Then there is the issue of what is stored in "@cp1251_charset". Since
it's the output of 'decode', then it must be decoded, right? Whatever
"decoded" means. You see my point. A comment would be helpful, and
this won't be possible if you pack everything into one line.

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

I'm sure there are legitimate improvements that could be made to my
code, but it baffles me that people should see packing into a oneliner
as something virtuous.

Dale Gerdemann

Dr.Ruud · Oct 21, 2007

Dale Gerdemann schreef:

Thanks Ilya and Affijn for your "improvements" but I still like my own
code better, because at least I break it down into commented steps.

Ahem, you are replying to the wrong message. I reply to the part that I
quote. So the relation to your code was broken by me on purpose.

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

What you are missing is that the $w in

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

contains exactly what is in your $w.

So for $s you can just do:

my $s = pack "C*", grep decode('cp1251', chr) =~ /\s/, 0..255;

Perhaps you like it more like this:

$cp1251_word_chars =
pack("C*", grep decode('cp1251', chr) =~ /\w/, 0..255);
$cp1251_whitespace_chars =
pack("C*", grep decode('cp1251', chr) =~ /\s/, 0..255);

so that your

m/($search_word)[$s]([$w]+)/g)

becomes

m/($search_word)[$cp1251_whitespace_chars]([$cp1251_word_chars]+)/g

And maybe you should allow more than 1 whitespace character there:

m/($search_word)[$cp1251_whitespace_chars]+([$cp1251_word_chars]+)/g

And if your $search_word can ever contain regex metacharacters, look
into quotemeta.

Ilya Zakharevich · Oct 24, 2007

[A complimentary Cc of this posting was sent to
Dale Gerdemann

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

What makes you think that "improvers of your code" missed this? At
least, I explicitly said that your solution might be quickier.

I'm sure there are legitimate improvements that could be made to my
code, but it baffles me that people should see packing into a oneliner
as something virtuous.

It was "your code packed into a oneliner". It was absolutely
different code; and if you do not like oneliners, just unpack it using
dummy variables.

What your code had was using encode/decode cycle, while your intent
was, obviously, to do only a decode. I corrected your code to match
your intent.

Hope this helps,
Ilya

LWP and Unicode	17	Oct 2, 2006
Cyrillic web pages with Perl and MySQL	1	Nov 28, 2005
[LONG] java.net.URI encoding weirdness	18	May 5, 2014
Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
Wierd IE Problem - broken images and bizzare text output	12	Jan 2, 2006
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
encoding misunderstanding	0	Jul 27, 2007
byte count unicode string	0	Sep 21, 2006

polymorphic regex -- encoding issue

Dale

Ben Morrow

Dale

Ilya Zakharevich

Dr.Ruud

Dale Gerdemann

Dr.Ruud

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads