Quoth "Frank said:
Ben,
This causes an infiinite loop. The substitution is simply not being
made.
Are you *sure*? Either it should match, or not; even if it fails to
match, it shouldn't loop. Try replacing it with something like
print "replaced '$1'" while $html =~ s/$special / /g;
I also tried replacing the copyright character © like so:
my $copy = "\x{00a9}";
if ($html =~ s|$copy|©|g) {
Don't use | as a delimiter for regexes: it's magic, so you will confuse
people (though not Perl
).
print "REPLACED COPYRIGHT\n";
}
but no replacement was made.
I also tried this:
$copy = utf8::encode("©");
What did you hope to acheive by this? As a general rule, the only
functions you should use to deal with character sets are Encode::encode
and Encode::decode, and the corresponding :encoding PerlIO layer; encode
to convert characters into raw octets (suitable to be written to a
filehandle that has had binmode ':raw' applied to it), and decode to
convert back again. In particular, IMHO *all* the functions in the utf8
namespace should be regarded as perl-internal, and not for normal use.
if ($html =~ s|$copy|©|g) {
print "REPLACED COPYRIGHT\n";
}
and it changed the text
id="copy1"> © 2005 Farmer
Is the A-circumflex I can see above actually present in your data, or is
it an artefact of Usenet, or of your terminal (what I see has
right-angle space A-circumflex copyright space two zero zero five
in the middle)?
to
id="copy1">©©©2005©Farmer
Notice the © char is still present, but all the spaces were replaced.
Very odd. I have a *lot* of experience with regex in Perl, and I've
never seen this before. Unfortunately, I have very little experience
with utf8 which I believe is at the core of this problem.
This is indeed odd. With both 5.8.8 and 5.8.0 (and a *lot* of utf8 bugs
were fixed in 5.8.1), and this script:
#!/usr/bin/perl -l
use warnings;
use strict;
use LWP::Simple qw/get/;
# This is because my terminal expects UTF8 output.
# The only difference it makes to the output is that without it the
# copyright symbol on the third line of output is displayed as an
# invalid character.
binmode \*STDOUT, ':encoding(utf8)';
$_ = get '
http://www.cs.odu.edu/~fmccown/buy-online.html';
my $copy = "\xa9";
/(> .* 2005)/x and print join ' ', map ord, split //, $1;
print "matches before s///" if /$copy/;
/> (.*? $copy .*?) </x and print $1;
print "replaced a copyright" while s/$copy/©/;
print "matches after s///" if /$copy/;
/> (.*? © .*?) </x and print $1;
__END__
I get
62 32 169 32 50 48 48 53
matches before s///
© 2005 Farmer India. All rights reserved.
replaced a copyright
© 2005 Farmer India. All rights reserved.
which is what I would have expected. How are you getting the data into
Perl? Perhaps you don't have what you think you have for some reason. If
you are using LWP, which version (I am using 5.803)? Can you try the
script above, and see what you get?
Ben