Intermittent Character Encoding Issues

David Murray-Rust · Nov 4, 2003

Hi all,

Please excuse the long post, but this seems to be a subtle bug, which
I have been attacking for a while.

I'm having a problem with character encodings in perl 5.8. The overall
effect is that certain characters, in particular the UK pound symbol
are turned into two characters, generally a capital A circumflex
followed by the correct character. This would appear to be a simple
character encoding issue, but there are a few caveats:

- It only happens on one machine. Taking a disk image of the OS and
running it on different hardware results in a system without the
problem.

- It can be intermittent. Two separate instances of (apparently) the
same problem have been found. The first happened about 1% of the
time the code was run. The second happened every time the code was
run.

More Detail:

The application is a web based content management system, running
under apache/mod-perl with a mysql back end. The machine in question
is running Slackware 9, Perl 5.8.1 and kernel 2.4.20.

The precise nature of the bug is that a character represented by \243
(163 decimal) in the iso-8859-1 character set is replaced by two
octets, \302\243, in some places. It appears that perl is converting
the data to a unicode representation and forgetting that it has done
this.

The first version of the bug was that after the line:

$contentList = [ join '', @$contentList ] unless $separate;

certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:

unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}

made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.

In the second version of the bug, the line:

return $return . $parent;

resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem. The current solution is:

my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;

NOTE: the characters which are altered are those in $return, while the
string whose endcoding I am playing with is in $parent.

In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.

With resepect to machine dependance, it happens on only one machine
which is running our software. To create a test platform, we took a
disk image of the system partition, loaded it onto a new machine and
compiled a new kernel which differed only in network card support.
This new machine did not fix the problem. As we were originally
running perl 5.8.0, we tried upgrading to 5.8.1, but this had no
effect.

So, to sum up,

- Can anyone explain what is going on here, the intermittent
occurences, the machine dependance and the general behaviour?

- Can anyone suggest a way to avoid these problems?

(For the record, I've read the perldoc on perlunicode and utf8, lurked
for a while, read google archives and read a fair amount on character
encodings)

Thanks to anyone who's made it this far for your time,
Dave Murray-Rust

Ben Morrow · Nov 4, 2003

David Murray-Rust said:
The first version of the bug was that after the line:

$contentList = [ join '', @$contentList ] unless $separate;

certain characters in the entries in @$contentList would be changed to
two-byte versions. The only happened about 1% of the time this code
was run. Changing the above line to be:

unless( $separate )
{
my $tmp = "";
foreach my $contentBit ( @$contentList )
{
$tmp .= $contentBit;
}
$contentList = [ $tmp ];
}

made the problem go away. In this case, the data comes directly from
the mysql database. It has been verified that the string is encoded
correctly up until that line, and wrongly afterwards.

How perl stores the data internally should be considered none of your
business. (It is in fact either iso8859-1 or utf8 on ASCII machines,
with a flag set on each scalar to say which. It is easier, however, to
regard a text string as being a set of Unicode characters, and not
worry about how they are represented.) However, it may be that how it
is stored in your mysql database is confusing perl, if the code you
are using to interface to the database doesn't correctly decode the
data into perl's own encoding. In particular, if you use iso8859-1 you
may get bitten far more irregularly than if you use other encodings.

Decide on how you are going to encode text in the database: I
shall assume you wish to use iso8859-1. Now, every piece of textual
(as opposed to binary) data you write into the database should first
be converted from a sequence of characters into a sequence of octets,
using Encode::encode; and every piece of textual data should be
converted from octets back into character data using
Encode::decode. So, in the example above, you would write:

my $tmp = "";
foreach my $contentBit (@$contentList) {
$tmp .= decode "iso8859-1", $content_Bit;
}
$contentList = [ $tmp ];

(assuming you didn't decode it closer to where it was read from the
database).

In the second version of the bug, the line:

return $return . $parent;

resulted in a string being returned where all the pound signs in
$return had been altered. If a different string to $parent is
appended, there is no problem.

So what does $parent contain, which causes this problem? And what is
the result of
use Encode qw/is_utf8/;
warn is_utf8($parent) ?
"\$parent is chars internally" :
"\$parent is bytes internally";

?

The current solution is:

my $tmpParent = encode( "iso-8859-1", $parent );
return $return . $tmpParent;

This is almost certainly Wrong, as $tmpParent will here be considered
to be a string of octets rather than a sequence of characters. The
Right Answer is to make sure $return is considered to be a sequence of
characters as well.

In this case, there is data in $parent which comes via CGI, so I would
be able to believe an explanation along the lines of "$parent is
magically recognised as utf8, so when it is added to $return, $return
is converted to utf8 octets before they are joined", but I would find
this quite counter intuitive, since as I understand things perl uses
it's own internal representation for strings, and should only need to
convert on the way in or out.

Yup. However, if the module you are using to talk to the database
and/or Apache hasn't been upgraded to 5.8 yet you will have to do
those conversions 'at the borders' by hand. Pushing an :encoding layer
onto your filehandles, perhaps with the 'open' pragma, may help
automate this; although you are using mod_perl, which relies on tied
filehandles: I don't know how well these play with PerlIO layers as
yet. You may want to write a custom 'print', 'readline' &c. that runs
all input through 'decode' and all output through 'encode'.

Another thing to watch out for is that if any of your locale variables
(LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
be in UTF8 until you disillusion it. This feature has been removed in
5.8.1, though, so it shouldn't be affecting your problem.

An alternative solution, if you can afford to treat all data as
'binary' rather than 'textual', is simply to put

use bytes;

at the top of every file

.

Ben

David Murray-Rust · Nov 7, 2003

[ this is a repost of a response which I accidentally emailed to Ben.
Sorry Ben! ]

How perl stores the data internally should be considered none of your
business.

Amen to that. I would far rather not need to know

However, it may be that how it
is stored in your mysql database is confusing perl, if the code you
are using to interface to the database doesn't correctly decode the
data into perl's own encoding. In particular, if you use iso8859-1 you
may get bitten far more irregularly than if you use other encodings.

I agree with this, except that the data from the database has been
concatenated, regexed etc. with no problem, before a seemingly
innocent line causes problems.

[ snip good advice about dealing with encodings on the way into and
out of the database ]

So what does $parent contain, which causes this problem? And what is
the result of
use Encode qw/is_utf8/;
warn is_utf8($parent) ?
"\$parent is chars internally" :
"\$parent is bytes internally";

Ah. Here I find that $parent is characters, while $return is bytes.
This sort of explains things, except that it would mean that:

- perl sees a sequence of characters and a sequence of bytes being
concatenated
- it converts the bytes to characters
- it then concatenates two character sequences
- it then forgets that this is now a character sequence, and treats
the result as bytes.

This does not seem like good behaviour - I'd be tempted to suggest
it's a bug.

Further, what would lead perl to treat one set of data as characters
and one set as bytes? both strings are valid XML fragments (built up
using data from CGI). The $parent string which is treated as
characters contains only [A-Za-z0-9<>-='"/? ], so I can't see any
reason for perl to suddenly decide it needs to be character data.

(As a side note, your snippet implies that the is_utf8 flag indicates
whether the data is to be treated as characters or as bytes, rather
than indicating whether or not it is characters in the utf8 character
set - could you clarify? )

This is almost certainly Wrong, as $tmpParent will here be considered
to be a string of octets rather than a sequence of characters. The
Right Answer is to make sure $return is considered to be a sequence of
characters as well.

Yes, that makes sense

Yup. However, if the module you are using to talk to the database
and/or Apache hasn't been upgraded to 5.8 yet you will have to do
those conversions 'at the borders' by hand.

I will have a look into the modules we're using

Another thing to watch out for is that if any of your locale variables
(LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
be in UTF8 until you disillusion it. This feature has been removed in
5.8.1, though, so it shouldn't be affecting your problem.

Yup, that's why I upgraded

An alternative solution, if you can afford to treat all data as
'binary' rather than 'textual', is simply to put

use bytes;

at the top of every file .

Except that as soon as I did that, someone would decide we needed
unicode support

Thanks for your help,
dave

Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Encoding of character literals	4	Nov 3, 2011
Why "Wide character in print"?	40	Sep 30, 2012
How to replace UniCode representation with actual character?	6	Dec 18, 2013
mod_perl/cgi character encoding issues	1	Jul 29, 2005
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
AJAX vs form submission (character encoding)	2	Jan 26, 2012
character encoding question	2	Mar 26, 2010

Intermittent Character Encoding Issues

David Murray-Rust

Ben Morrow

David Murray-Rust

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads