Regex failed to replace utf8 character

Frank · Nov 29, 2006

I'm trying to replace a character from an HTML file I got from Yahoo's
search engine cache. I'm reading in the file line by line and applying
the following regex:

if (utf8::is_utf8($html)) {
print "Is UTF8\n";
}

# Char that I copied and pasted directly from the HTML file
my $special = utf8::encode("Â");

if ($html =~ /$special /) {
print "TRUE:\n";
}

if ($html =~ s|$special | |g) {
print "REPACED\n";
}

if ($html =~ /$special /) {
print "STILL TRUE\n";
}

What I'm seeing is:
Is UTF8
TRUE
REPLACED
STILL TRUE

I don't understand why the character is not being replaced by a space.
I've been working on this for hours. Any help would be much
appreciated.

Frank

Ben Morrow · Nov 29, 2006

Quoth "Frank said:
I'm trying to replace a character from an HTML file I got from Yahoo's
search engine cache. I'm reading in the file line by line and applying
the following regex:

if (utf8::is_utf8($html)) {
print "Is UTF8\n";
}

# Char that I copied and pasted directly from the HTML file
my $special = utf8::encode("Â");

if ($html =~ /$special /) {
print "TRUE:\n";
}

Just an idea: try adding

if ($html =~ /$special$special /) {
print "Whoops! There are two of them!\n";
}

here. You may also find the idiom

1 while $html =~ s/$special / /g;

useful, which keeps trying to do the replacement until the result
doesn't match.

Ben

Brian McCauley · Nov 29, 2006

$html =~ s|$special | |g

I don't understand why the character is not being replaced by a space.

Perhaps because the thing in the LHS of the replacement operator is not
just the character, it's the character followed by a space.

Frank · Nov 29, 2006

Ben,

1 while $html =~ s/$special / /g;

This causes an infiinite loop. The substitution is simply not being
made. I also tried replacing the copyright character © like so:

my $copy = "\x{00a9}";
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

but no replacement was made.

I also tried this:

$copy = utf8::encode("©");
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

and it changed the text

id="copy1"> Â© 2005 Farmer

to

id="copy1">©©©2005©Farmer

Notice the © char is still present, but all the spaces were replaced.
Very odd. I have a *lot* of experience with regex in Perl, and I've
never seen this before. Unfortunately, I have very little experience
with utf8 which I believe is at the core of this problem.

If anyone wants to take a look at the file I'm parsing, I've posted it
here:
http://www.cs.odu.edu/~fmccown/buy-online.html

I'm running Perl v5.8.3 on Redhat Linux.

Thanks,
Frank

Frank · Nov 29, 2006

When I posted my previous response using Google Groups, they apparently
changed all my "ampersand copy semicolon" parts into the copyright
symbol. The conversion of the string should look like this:

# Assume X is "ampersand copy semicolon"
id="copy1">X©X2005XFarmer

Thanks,
Frank

Ben Morrow · Nov 30, 2006

Quoth "Frank said:
Ben,

This causes an infiinite loop. The substitution is simply not being
made.

Are you *sure*? Either it should match, or not; even if it fails to
match, it shouldn't loop. Try replacing it with something like

print "replaced '$1'" while $html =~ s/$special / /g;

I also tried replacing the copyright character © like so:

my $copy = "\x{00a9}";
if ($html =~ s|$copy|©|g) {

Don't use | as a delimiter for regexes: it's magic, so you will confuse
people (though not Perl

).

print "REPLACED COPYRIGHT\n";
}

but no replacement was made.

I also tried this:

$copy = utf8::encode("©");

What did you hope to acheive by this? As a general rule, the only
functions you should use to deal with character sets are Encode::encode
and Encode::decode, and the corresponding :encoding PerlIO layer; encode
to convert characters into raw octets (suitable to be written to a
filehandle that has had binmode ':raw' applied to it), and decode to
convert back again. In particular, IMHO *all* the functions in the utf8
namespace should be regarded as perl-internal, and not for normal use.

if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}

and it changed the text

id="copy1"> Â© 2005 Farmer

Is the A-circumflex I can see above actually present in your data, or is
it an artefact of Usenet, or of your terminal (what I see has
right-angle space A-circumflex copyright space two zero zero five
in the middle)?

to

id="copy1">©©©2005©Farmer

Notice the © char is still present, but all the spaces were replaced.
Very odd. I have a *lot* of experience with regex in Perl, and I've
never seen this before. Unfortunately, I have very little experience
with utf8 which I believe is at the core of this problem.

This is indeed odd. With both 5.8.8 and 5.8.0 (and a *lot* of utf8 bugs
were fixed in 5.8.1), and this script:

#!/usr/bin/perl -l

use warnings;
use strict;

use LWP::Simple qw/get/;

# This is because my terminal expects UTF8 output.
# The only difference it makes to the output is that without it the
# copyright symbol on the third line of output is displayed as an
# invalid character.

binmode \*STDOUT, ':encoding(utf8)';

$_ = get 'http://www.cs.odu.edu/~fmccown/buy-online.html';
my $copy = "\xa9";

/(> .* 2005)/x and print join ' ', map ord, split //, $1;

print "matches before s///" if /$copy/;
/> (.*? $copy .*?) </x and print $1;

print "replaced a copyright" while s/$copy/©/;

print "matches after s///" if /$copy/;
/> (.*? © .*?) </x and print $1;

__END__

I get

62 32 169 32 50 48 48 53
matches before s///
© 2005 Farmer India. All rights reserved.
replaced a copyright
© 2005 Farmer India. All rights reserved.

which is what I would have expected. How are you getting the data into
Perl? Perhaps you don't have what you think you have for some reason. If
you are using LWP, which version (I am using 5.803)? Can you try the
script above, and see what you get?

Ben

Frank · Nov 30, 2006

Ben,

Thanks for your help. I was able to locate 3 problems with what I was
doing:

1) Reading the file into Perl using

open(F, "<:utf8", $fn);

was the first problem. Reading it in normally was what I should have
been doing and was doing originally, but I had added this later when
trying to figure out what was going on.

2) I should have replaced the copyright like you did:

my $copy = "\xa9";
$html =~ s/$copy/C/g

3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

but when I corrected it to the following, the substitution worked fine:

my $special = "\xa0";
print "sub\n" while ($html =~ s/$special/ /g);

Thanks again,
Frank

Dr.Ruud · Nov 30, 2006

Frank schreef:

3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

but when I corrected it to the following, the substitution worked
fine:

my $special = "\xa0";
print "sub\n" while ($html =~ s/$special/ /g);

Do you have anything special like "use utf8;" in your source?

Frank · Dec 1, 2006

Dr.Ruud said:
Frank schreef:

Do you have anything special like "use utf8;" in your source?

No, that's not in there. I think the use of utf8::encode("Â") is
doing something flaky, and since I've gotten it to work, I'm not that
concerned about the problem.

Thanks,
Frank

anno4000 · Dec 1, 2006

Frank said:
Ben,

Thanks for your help. I was able to locate 3 problems with what I was
doing:

1) Reading the file into Perl using

open(F, "<:utf8", $fn);

was the first problem. Reading it in normally was what I should have
been doing and was doing originally, but I had added this later when
trying to figure out what was going on.

2) I should have replaced the copyright like you did:

my $copy = "\xa9";
$html =~ s/$copy/C/g

3) This still ran in an infinite loop:

my $special = utf8::encode("Â");
print "sub\n" while ($html =~ s/$special / /g);

Ha! You're not running under warnings and you didn't read the
pertinent documentation. $special is undefined after this statement.
See perldoc utf8:

* utf8::encode($string)
Converts in-place the character sequence to the corresponding octet
sequence in UTF-X. The UTF-8 flag is turned off. Returns nothing.

That also explains the infinite loop.

Anno

Frank · Dec 1, 2006

my $special = utf8::encode("Â");

Ha! You're not running under warnings and you didn't read the
pertinent documentation. $special is undefined after this statement.
See perldoc utf8:

* utf8::encode($string)
Converts in-place the character sequence to the corresponding octet
sequence in UTF-X. The UTF-8 flag is turned off. Returns nothing.

That also explains the infinite loop.

Anno

Nice- it's usually the most obvious explanations that get overlooked.

Frank

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Confused by utf8/sysread/syswrite/DBD::Pg	1	Dec 29, 2009
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
utf8 and chomp	13	Feb 22, 2009
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
Is the pod of Encode::MIME::Header giving wrong advice?	5	Apr 23, 2014
Clickable link conversion regex?	0	Nov 30, 2012
Regex to match a numerical IP range	7	Dec 11, 2010

Regex failed to replace utf8 character

Frank

Ben Morrow

Brian McCauley

Frank

Frank

Ben Morrow

Frank

Dr.Ruud

Frank

anno4000

Frank

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads