D
David Murray-Rust
Hi all,
Please excuse the long post, but this seems to be a subtle bug, which
I have been attacking for a while.
I'm having a problem with character encodings in perl 5.8. The overall
effect is that certain characters, in particular the UK pound symbol
are turned into two characters, generally a capital A circumflex
followed by the correct character. This would appear to be a simple
character encoding issue, but there are a few caveats:
- It only happens on one machine. Taking a disk image of the OS and
running it on different hardware results in a system without the
problem.
- It can be intermittent. Two separate instances of (apparently) the
same problem have been found. The first happened about 1% of the
time the code was run. The second happened every time the code was
run.
More Detail:
The application is a web based content management system, running
under apache/mod-perl with a mysql back end. The machine in question
is running Slackware 9, Perl 5.8.1 and kernel 2.4.20.
The precise nature of the bug is that a character represented by \243
(163 decimal) in the iso-8859-1 character set is replaced by two
octets, \302\243, in some places. It appears that perl is converting
the data to a unicode representation and forgetting that it has done
this.
The first version of the bug was that after the line:
$contentList = [ join '', @$contentList ] unless $separate;
certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:
unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}
made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.
In the second version of the bug, the line:
return $return . $parent;
resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem. The current solution is:
my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;
NOTE: the characters which are altered are those in $return, while the
string whose endcoding I am playing with is in $parent.
In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.
With resepect to machine dependance, it happens on only one machine
which is running our software. To create a test platform, we took a
disk image of the system partition, loaded it onto a new machine and
compiled a new kernel which differed only in network card support.
This new machine did not fix the problem. As we were originally
running perl 5.8.0, we tried upgrading to 5.8.1, but this had no
effect.
So, to sum up,
- Can anyone explain what is going on here, the intermittent
occurences, the machine dependance and the general behaviour?
- Can anyone suggest a way to avoid these problems?
(For the record, I've read the perldoc on perlunicode and utf8, lurked
for a while, read google archives and read a fair amount on character
encodings)
Thanks to anyone who's made it this far for your time,
Dave Murray-Rust
Please excuse the long post, but this seems to be a subtle bug, which
I have been attacking for a while.
I'm having a problem with character encodings in perl 5.8. The overall
effect is that certain characters, in particular the UK pound symbol
are turned into two characters, generally a capital A circumflex
followed by the correct character. This would appear to be a simple
character encoding issue, but there are a few caveats:
- It only happens on one machine. Taking a disk image of the OS and
running it on different hardware results in a system without the
problem.
- It can be intermittent. Two separate instances of (apparently) the
same problem have been found. The first happened about 1% of the
time the code was run. The second happened every time the code was
run.
More Detail:
The application is a web based content management system, running
under apache/mod-perl with a mysql back end. The machine in question
is running Slackware 9, Perl 5.8.1 and kernel 2.4.20.
The precise nature of the bug is that a character represented by \243
(163 decimal) in the iso-8859-1 character set is replaced by two
octets, \302\243, in some places. It appears that perl is converting
the data to a unicode representation and forgetting that it has done
this.
The first version of the bug was that after the line:
$contentList = [ join '', @$contentList ] unless $separate;
certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:
unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}
made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.
In the second version of the bug, the line:
return $return . $parent;
resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem. The current solution is:
my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;
NOTE: the characters which are altered are those in $return, while the
string whose endcoding I am playing with is in $parent.
In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.
With resepect to machine dependance, it happens on only one machine
which is running our software. To create a test platform, we took a
disk image of the system partition, loaded it onto a new machine and
compiled a new kernel which differed only in network card support.
This new machine did not fix the problem. As we were originally
running perl 5.8.0, we tried upgrading to 5.8.1, but this had no
effect.
So, to sum up,
- Can anyone explain what is going on here, the intermittent
occurences, the machine dependance and the general behaviour?
- Can anyone suggest a way to avoid these problems?
(For the record, I've read the perldoc on perlunicode and utf8, lurked
for a while, read google archives and read a fair amount on character
encodings)
Thanks to anyone who's made it this far for your time,
Dave Murray-Rust