Intermittent Character Encoding Issues

  • Thread starter David Murray-Rust
  • Start date
D

David Murray-Rust

Hi all,

Please excuse the long post, but this seems to be a subtle bug, which
I have been attacking for a while.

I'm having a problem with character encodings in perl 5.8. The overall
effect is that certain characters, in particular the UK pound symbol
are turned into two characters, generally a capital A circumflex
followed by the correct character. This would appear to be a simple
character encoding issue, but there are a few caveats:

- It only happens on one machine. Taking a disk image of the OS and
running it on different hardware results in a system without the
problem.

- It can be intermittent. Two separate instances of (apparently) the
same problem have been found. The first happened about 1% of the
time the code was run. The second happened every time the code was
run.


More Detail:

The application is a web based content management system, running
under apache/mod-perl with a mysql back end. The machine in question
is running Slackware 9, Perl 5.8.1 and kernel 2.4.20.

The precise nature of the bug is that a character represented by \243
(163 decimal) in the iso-8859-1 character set is replaced by two
octets, \302\243, in some places. It appears that perl is converting
the data to a unicode representation and forgetting that it has done
this.

The first version of the bug was that after the line:

$contentList = [ join '', @$contentList ] unless $separate;

certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:

unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}

made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.


In the second version of the bug, the line:

return $return . $parent;

resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem. The current solution is:

my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;

NOTE: the characters which are altered are those in $return, while the
string whose endcoding I am playing with is in $parent.

In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.

With resepect to machine dependance, it happens on only one machine
which is running our software. To create a test platform, we took a
disk image of the system partition, loaded it onto a new machine and
compiled a new kernel which differed only in network card support.
This new machine did not fix the problem. As we were originally
running perl 5.8.0, we tried upgrading to 5.8.1, but this had no
effect.

So, to sum up,

- Can anyone explain what is going on here, the intermittent
occurences, the machine dependance and the general behaviour?

- Can anyone suggest a way to avoid these problems?

(For the record, I've read the perldoc on perlunicode and utf8, lurked
for a while, read google archives and read a fair amount on character
encodings)

Thanks to anyone who's made it this far for your time,
Dave Murray-Rust
 
B

Ben Morrow

David Murray-Rust said:
The first version of the bug was that after the line:

$contentList = [ join '', @$contentList ] unless $separate;

certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:

unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}

made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.

How perl stores the data internally should be considered none of your
business. (It is in fact either iso8859-1 or utf8 on ASCII machines,
with a flag set on each scalar to say which. It is easier, however, to
regard a text string as being a set of Unicode characters, and not
worry about how they are represented.) However, it may be that how it
is stored in your mysql database is confusing perl, if the code you
are using to interface to the database doesn't correctly decode the
data into perl's own encoding. In particular, if you use iso8859-1 you
may get bitten far more irregularly than if you use other encodings.

Decide on how you are going to encode text in the database: I
shall assume you wish to use iso8859-1. Now, every piece of textual
(as opposed to binary) data you write into the database should first
be converted from a sequence of characters into a sequence of octets,
using Encode::encode; and every piece of textual data should be
converted from octets back into character data using
Encode::decode. So, in the example above, you would write:

my $tmp = "";
foreach my $contentBit (@$contentList) {
$tmp .= decode "iso8859-1", $content_Bit;
}
$contentList = [ $tmp ];

(assuming you didn't decode it closer to where it was read from the
database).
In the second version of the bug, the line:

return $return . $parent;

resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem.

So what does $parent contain, which causes this problem? And what is
the result of
use Encode qw/is_utf8/;
warn is_utf8($parent) ?
"\$parent is chars internally" :
"\$parent is bytes internally";

?
The current solution is:

my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;

This is almost certainly Wrong, as $tmpParent will here be considered
to be a string of octets rather than a sequence of characters. The
Right Answer is to make sure $return is considered to be a sequence of
characters as well.
In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.

Yup. However, if the module you are using to talk to the database
and/or Apache hasn't been upgraded to 5.8 yet you will have to do
those conversions 'at the borders' by hand. Pushing an :encoding layer
onto your filehandles, perhaps with the 'open' pragma, may help
automate this; although you are using mod_perl, which relies on tied
filehandles: I don't know how well these play with PerlIO layers as
yet. You may want to write a custom 'print', 'readline' &c. that runs
all input through 'decode' and all output through 'encode'.

Another thing to watch out for is that if any of your locale variables
(LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
be in UTF8 until you disillusion it. This feature has been removed in
5.8.1, though, so it shouldn't be affecting your problem.

An alternative solution, if you can afford to treat all data as
'binary' rather than 'textual', is simply to put

use bytes;

at the top of every file :).

Ben
 
D

David Murray-Rust

[ this is a repost of a response which I accidentally emailed to Ben.
Sorry Ben! ]

How perl stores the data internally should be considered none of your
business.
Amen to that. I would far rather not need to know ;)
However, it may be that how it
is stored in your mysql database is confusing perl, if the code you
are using to interface to the database doesn't correctly decode the
data into perl's own encoding. In particular, if you use iso8859-1 you
may get bitten far more irregularly than if you use other encodings.

I agree with this, except that the data from the database has been
concatenated, regexed etc. with no problem, before a seemingly
innocent line causes problems.

[ snip good advice about dealing with encodings on the way into and
out of the database ]
So what does $parent contain, which causes this problem? And what is
the result of
use Encode qw/is_utf8/;
warn is_utf8($parent) ?
"\$parent is chars internally" :
"\$parent is bytes internally";

Ah. Here I find that $parent is characters, while $return is bytes.
This sort of explains things, except that it would mean that:

- perl sees a sequence of characters and a sequence of bytes being
concatenated
- it converts the bytes to characters
- it then concatenates two character sequences
- it then forgets that this is now a character sequence, and treats
the result as bytes.

This does not seem like good behaviour - I'd be tempted to suggest
it's a bug.

Further, what would lead perl to treat one set of data as characters
and one set as bytes? both strings are valid XML fragments (built up
using data from CGI). The $parent string which is treated as
characters contains only [A-Za-z0-9<>-='"/? ], so I can't see any
reason for perl to suddenly decide it needs to be character data.

(As a side note, your snippet implies that the is_utf8 flag indicates
whether the data is to be treated as characters or as bytes, rather
than indicating whether or not it is characters in the utf8 character
set - could you clarify? )
This is almost certainly Wrong, as $tmpParent will here be considered
to be a string of octets rather than a sequence of characters. The
Right Answer is to make sure $return is considered to be a sequence of
characters as well.

Yes, that makes sense ;)
Yup. However, if the module you are using to talk to the database
and/or Apache hasn't been upgraded to 5.8 yet you will have to do
those conversions 'at the borders' by hand.

I will have a look into the modules we're using
Another thing to watch out for is that if any of your locale variables
(LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
be in UTF8 until you disillusion it. This feature has been removed in
5.8.1, though, so it shouldn't be affecting your problem.

Yup, that's why I upgraded :)
An alternative solution, if you can afford to treat all data as
'binary' rather than 'textual', is simply to put

use bytes;

at the top of every file :).

Except that as soon as I did that, someone would decide we needed
unicode support ;)

Thanks for your help,
dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top