polymorphic regex -- encoding issue

D

Dale

Consider the following:

my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Фибоначчи (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Фибоначчи");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

A couple details:

Perl version:
5.8.8

Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");
 
B

Ben Morrow

Quoth Dale said:
Consider the following:

my $html_string = get "http://stock.narod.ru/fibo.htm";
my $russian_page = decode("cp1251", $html_string);
while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Фибоначчи (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page = get "http://stock.narod.ru/fibo.htm";
my $search_word = encode("cp1251", "Фибоначчи");
while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

Nope, there isn't. All you can do is decode all the separate parts into
bytes, and then ask for a regex that matches by bytes.

At the very least you want a 'use bytes' around that regex and match.
You also need to be aware that perl will be doing a byte-by-byte match,
so if it's possible for part of a character to match (which depends on
the encoding: it is possible with UTF16, but not with UTF8, for
instance. I'm afraid I don't know about cp1251.) you will get false
positives. You also need to be sure that LWP is returning you the page
as bytes, and not trying to be clever and decoding it to UTF8 already. I
presume you already know that.

Unless you have an awful lot of these matches to do (and you know this
is what's slowing you down), it's not worth the bother.

Ben
 
D

Dale

Thanks Ben. The problem is, of course consistency. I want to make
sure, that I also decode '\w' and '\s' so that they match the same
things that they would have matched in the original regex. The perldoc
says one can influence what '\w' matches by using locales. But I
managed to find a consistent translation without using locales (now
I'm answering my own question):


# As before, I search for the word Fibonacci, in CP1251-encoded
Cyrillic
my $search_word = encode("cp1251", "Фибоначчи");

# CP1251 is an extended ASCII charset in the range 00-FF. Here we
# get this set of characters and decode them into Unicode.
my @cp1251_charset =
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

# Find out which of these characters are matched by '\w' (in Unicode).
my @cp1251_wordchars =
grep(/\w/, @cp1251_charset);

# The matched word characters are put back into CP1251
my $w = encode("CP1251", join("", @cp1251_wordchars));

# We follow the same idea as above for the space characters.
my @cp1251_spacechars =
grep(/\s/, @cp1251_charset);
my $s = encode("CP1251", join("", @cp1251_spacechars));

# Now we just put the pieces together
my $russian_page = get "http://stock.narod.ru/fibo.htm";
while ($russian_page =~ m/($search_word)[$s]([$w]+)/g) {
print decode("cp1251", "$1 $2\n");
}


Details (same as in previous version):

Perl version
5.8.8

modules used
Encode;
LWP::Simple qw(get);
utf8;
binmode(STDOUT, ":utf8");

Note: Why didn't I use setlocale, as the Perldoc suggests? First
reason: Our computers are somehow set up with a very limited range of
possible locales. Second reason: locales are confusing for me. I
prefer to avoid them. I set my environment to en_US.utf8 and I don't
want to think about locales any more after that.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dale
# CP1251 is an extended ASCII charset in the range 00-FF. Here we
# get this set of characters and decode them into Unicode.
my @cp1251_charset =3D
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

# Find out which of these characters are matched by '\w' (in Unicode).
my @cp1251_wordchars =3D
grep(/\w/, @cp1251_charset);

# The matched word characters are put back into CP1251
my $w =3D encode("CP1251", join("", @cp1251_wordchars));

To baroque, IMO. I would use something like

my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr, 0x00..0xFF;

Your approach has a chance to be quickier, though, but since this
should only run once... [I did not benchmark them.]

Ilya
 
D

Dr.Ruud

Ilya Zakharevich schreef:
my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr,
0x00..0xFF;

Alternative:

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;
 
D

Dale Gerdemann

Thanks Ilya and Affijn for your "improvements" but I still like my own
code better, because at least I break it down into commented steps. I
know my comments are minimal, but at least I tried. The reader of my
code is bound to find several things confusing:
my @cp1251_charset =
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

Here are some questions that are bound to arise:

Why "decode CP1251"? How can you see that the input was ever encoded
as CP1251 to begin with? We must be assuming that 'chr' returns
something that can at least be thought of as as CP1251 encoded. But
consider the small test program:

print chr(0xFF);

This may print out ÿ (LATIN SMALL LETTER Y WITH DIAERESIS), a
character that doesn't even exist in CP1251. Of course, it only prints
out this character if you're using "binmode(STDOUT, ":utf8");" or "use
encoding 'utf8';", but you can see that there is plenty of room for
confusion.

Then there is the issue of what is stored in "@cp1251_charset". Since
it's the output of 'decode', then it must be decoded, right? Whatever
"decoded" means. You see my point. A comment would be helpful, and
this won't be possible if you pack everything into one line.

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

I'm sure there are legitimate improvements that could be made to my
code, but it baffles me that people should see packing into a oneliner
as something virtuous.

Dale Gerdemann
 
D

Dr.Ruud

Dale Gerdemann schreef:
Thanks Ilya and Affijn for your "improvements" but I still like my own
code better, because at least I break it down into commented steps.

Ahem, you are replying to the wrong message. I reply to the part that I
quote. So the relation to your code was broken by me on purpose.

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

What you are missing is that the $w in

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

contains exactly what is in your $w.

So for $s you can just do:

my $s = pack "C*", grep decode('cp1251', chr) =~ /\s/, 0..255;


Perhaps you like it more like this:

$cp1251_word_chars =
pack("C*", grep decode('cp1251', chr) =~ /\w/, 0..255);
$cp1251_whitespace_chars =
pack("C*", grep decode('cp1251', chr) =~ /\s/, 0..255);

so that your

m/($search_word)[$s]([$w]+)/g)

becomes

m/($search_word)[$cp1251_whitespace_chars]([$cp1251_word_chars]+)/g


And maybe you should allow more than 1 whitespace character there:

m/($search_word)[$cp1251_whitespace_chars]+([$cp1251_word_chars]+)/g


And if your $search_word can ever contain regex metacharacters, look
into quotemeta.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dale Gerdemann
But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

What makes you think that "improvers of your code" missed this? At
least, I explicitly said that your solution might be quickier.
I'm sure there are legitimate improvements that could be made to my
code, but it baffles me that people should see packing into a oneliner
as something virtuous.

It was "your code packed into a oneliner". It was absolutely
different code; and if you do not like oneliners, just unpack it using
dummy variables.

What your code had was using encode/decode cycle, while your intent
was, obviously, to do only a decode. I corrected your code to match
your intent.

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,237
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top