Regex failed to replace utf8 character

F

Frank

I'm trying to replace a character from an HTML file I got from Yahoo's
search engine cache. I'm reading in the file line by line and applying
the following regex:

if (utf8::is_utf8($html)) {
print "Is UTF8\n";
}

# Char that I copied and pasted directly from the HTML file
my $special = utf8::encode("Â");

if ($html =~ /$special /) {
print "TRUE:\n";
}

if ($html =~ s|$special | |g) {
print "REPACED\n";
}

if ($html =~ /$special /) {
print "STILL TRUE\n";
}

What I'm seeing is:
Is UTF8
TRUE
REPLACED
STILL TRUE

I don't understand why the character is not being replaced by a space.
I've been working on this for hours. Any help would be much
appreciated.

Frank
 
B

Ben Morrow

Quoth "Frank said:
I'm trying to replace a character from an HTML file I got from Yahoo's
search engine cache. I'm reading in the file line by line and applying
the following regex:

if (utf8::is_utf8($html)) {
print "Is UTF8\n";
}

# Char that I copied and pasted directly from the HTML file
my $special = utf8::encode("Â");

if ($html =~ /$special /) {
print "TRUE:\n";
}

Just an idea: try adding

if ($html =~ /$special$special /) {
print "Whoops! There are two of them!\n";
}

here. You may also find the idiom

1 while $html =~ s/$special / /g;

useful, which keeps trying to do the replacement until the result
doesn't match.

Ben
 
B

Brian McCauley

$html =~ s|$special | |g
I don't understand why the character is not being replaced by a space.

Perhaps because the thing in the LHS of the replacement operator is not
just the character, it's the character followed by a space.
 
F

Frank

Ben,
1 while $html =~ s/$special / /g;

This causes an infiinite loop. The substitution is simply not being
made. I also tried replacing the copyright character © like so:

my $copy = "\x{00a9}";
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

but no replacement was made.

I also tried this:

$copy = utf8::encode("©");
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

and it changed the text

id="copy1"> © 2005 Farmer

to

id="copy1">©©©2005©Farmer

Notice the © char is still present, but all the spaces were replaced.
Very odd. I have a *lot* of experience with regex in Perl, and I've
never seen this before. Unfortunately, I have very little experience
with utf8 which I believe is at the core of this problem.

If anyone wants to take a look at the file I'm parsing, I've posted it
here:
http://www.cs.odu.edu/~fmccown/buy-online.html

I'm running Perl v5.8.3 on Redhat Linux.

Thanks,
Frank
 
F

Frank

When I posted my previous response using Google Groups, they apparently
changed all my "ampersand copy semicolon" parts into the copyright
symbol. The conversion of the string should look like this:

# Assume X is "ampersand copy semicolon"
id="copy1">X©X2005XFarmer

Thanks,
Frank
 
B

Ben Morrow

Quoth "Frank said:
Ben,


This causes an infiinite loop. The substitution is simply not being
made.

Are you *sure*? Either it should match, or not; even if it fails to
match, it shouldn't loop. Try replacing it with something like

print "replaced '$1'" while $html =~ s/$special / /g;
I also tried replacing the copyright character © like so:

my $copy = "\x{00a9}";
if ($html =~ s|$copy|©|g) {

Don't use | as a delimiter for regexes: it's magic, so you will confuse
people (though not Perl :) ).
print "REPLACED COPYRIGHT\n";
}

but no replacement was made.

I also tried this:

$copy = utf8::encode("©");

What did you hope to acheive by this? As a general rule, the only
functions you should use to deal with character sets are Encode::encode
and Encode::decode, and the corresponding :encoding PerlIO layer; encode
to convert characters into raw octets (suitable to be written to a
filehandle that has had binmode ':raw' applied to it), and decode to
convert back again. In particular, IMHO *all* the functions in the utf8
namespace should be regarded as perl-internal, and not for normal use.
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

and it changed the text

id="copy1"> © 2005 Farmer

Is the A-circumflex I can see above actually present in your data, or is
it an artefact of Usenet, or of your terminal (what I see has
right-angle space A-circumflex copyright space two zero zero five
in the middle)?
to

id="copy1">©©©2005©Farmer

Notice the © char is still present, but all the spaces were replaced.
Very odd. I have a *lot* of experience with regex in Perl, and I've
never seen this before. Unfortunately, I have very little experience
with utf8 which I believe is at the core of this problem.

This is indeed odd. With both 5.8.8 and 5.8.0 (and a *lot* of utf8 bugs
were fixed in 5.8.1), and this script:

#!/usr/bin/perl -l

use warnings;
use strict;

use LWP::Simple qw/get/;

# This is because my terminal expects UTF8 output.
# The only difference it makes to the output is that without it the
# copyright symbol on the third line of output is displayed as an
# invalid character.

binmode \*STDOUT, ':encoding(utf8)';

$_ = get 'http://www.cs.odu.edu/~fmccown/buy-online.html';
my $copy = "\xa9";

/(> .* 2005)/x and print join ' ', map ord, split //, $1;

print "matches before s///" if /$copy/;
/> (.*? $copy .*?) </x and print $1;

print "replaced a copyright" while s/$copy/&copy;/;

print "matches after s///" if /$copy/;
/> (.*? &copy; .*?) </x and print $1;

__END__

I get

62 32 169 32 50 48 48 53
matches before s///
© 2005 Farmer India. All rights reserved.
replaced a copyright
&copy; 2005 Farmer India. All rights reserved.

which is what I would have expected. How are you getting the data into
Perl? Perhaps you don't have what you think you have for some reason. If
you are using LWP, which version (I am using 5.803)? Can you try the
script above, and see what you get?

Ben
 
F

Frank

Ben,

Thanks for your help. I was able to locate 3 problems with what I was
doing:

1) Reading the file into Perl using

open(F, "<:utf8", $fn);

was the first problem. Reading it in normally was what I should have
been doing and was doing originally, but I had added this later when
trying to figure out what was going on.

2) I should have replaced the copyright like you did:

my $copy = "\xa9";
$html =~ s/$copy/C/g

3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

but when I corrected it to the following, the substitution worked fine:

my $special = "\xa0";
print "sub\n" while ($html =~ s/$special/ /g);


Thanks again,
Frank
 
D

Dr.Ruud

Frank schreef:
3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

but when I corrected it to the following, the substitution worked
fine:

my $special = "\xa0";
print "sub\n" while ($html =~ s/$special/ /g);

Do you have anything special like "use utf8;" in your source?
 
F

Frank

Dr.Ruud said:
Frank schreef:


Do you have anything special like "use utf8;" in your source?

No, that's not in there. I think the use of utf8::encode("Â") is
doing something flaky, and since I've gotten it to work, I'm not that
concerned about the problem.

Thanks,
Frank
 
A

anno4000

Frank said:
Ben,

Thanks for your help. I was able to locate 3 problems with what I was
doing:

1) Reading the file into Perl using

open(F, "<:utf8", $fn);

was the first problem. Reading it in normally was what I should have
been doing and was doing originally, but I had added this later when
trying to figure out what was going on.

2) I should have replaced the copyright like you did:

my $copy = "\xa9";
$html =~ s/$copy/C/g

3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

Ha! You're not running under warnings and you didn't read the
pertinent documentation. $special is undefined after this statement.
See perldoc utf8:

* utf8::encode($string)
Converts in-place the character sequence to the corresponding octet
sequence in UTF-X. The UTF-8 flag is turned off. Returns nothing.

That also explains the infinite loop.

Anno
 
F

Frank

my $special = utf8::encode("Â");
Ha! You're not running under warnings and you didn't read the
pertinent documentation. $special is undefined after this statement.
See perldoc utf8:

* utf8::encode($string)
Converts in-place the character sequence to the corresponding octet
sequence in UTF-X. The UTF-8 flag is turned off. Returns nothing.

That also explains the infinite loop.

Anno

Nice- it's usually the most obvious explanations that get overlooked.

Frank
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top