Perl 5.8.x, Unicode and In-memory Filehandles

Bernard Chan · Mar 1, 2006

Hello all,

I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.

I have written a module IO::OutputBuffer which is expected to be used as
follows:

$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original

Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).

Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.

As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.

I have minimized the process to a script as short as below:

================================================

#!/usr/bin/perl -w

binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

BEGIN {
require "require.pl";
}

#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;

open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

print (join(" \n", @lines));

#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

print $io_sys $buffered_content;

====================================

However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:

Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
 

 
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

Anyone can help me? Thank you.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

MSG · Mar 1, 2006

Bernard said:
Hello all,

I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.

I have written a module IO::OutputBuffer which is expected to be used as
follows:

$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original

Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).

Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.

As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.

I have minimized the process to a script as short as below:

================================================

#!/usr/bin/perl -w

binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

BEGIN {
require "require.pl";
}

#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;

open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

print (join(" \n", @lines));

#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

print $io_sys $buffered_content;

====================================

However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:

Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
 

 
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

Anyone can help me? Thank you.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

It seems suspicious that you set your STDOUT to "big5" at the very
beginning and then open and close STDOUT many times afterwards.
By the time you print, your STDOUT has already resumed to be
"standard'.
Anyway "wide character" warning indicates that you are outputing
unicode to an non-unicode file handle.

Bernard Chan · Mar 1, 2006

I am inclined to think this may be related to the in-memory nature of
the filehandle. In the latest revision of the test script I have tried this:

================================================
#!/usr/bin/perl -w

BEGIN {
require "require.pl";
}

my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">:utf8", \$BUF;

open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

my $buffered_content2 = (join(" \n", @lines)); # (1)
print (join(" \n", @lines));

my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

binmode($io_sys, ":encoding(big5)");
print $io_sys $buffered_content2; # (2)
================================================

Basically the modifications are labelled as (1) and (2). Line (1) is the
actual added line. In this program, when I try to print
$buffered_content on line (2) as before, the same output as previously
quoted was seen. However, when I change line (2) to $buffered_content2,
the output is exactly what I wanted (Big5). So it seems like there are
differences because the expression resulted from join() in both cases
were identical. The only difference was that one was read from the
variable representing the in-memory buffer, while the other directly as
generated from the join().

I checked that bytewise the two strings are byte-to-byte identical, and
that after using utf8::upgrade($buffer_content) both strings are valid
UTF-8 with the UTF-8 flag set, but "eq" the two strings still returns
false. I think there should be some intricate stuff in there.

Anyone may explain why this is so? Thank you in advance.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Bernard Chan · Mar 1, 2006

MSG said:
It seems suspicious that you set your STDOUT to "big5" at the very
beginning and then open and close STDOUT many times afterwards. By
the time you print, your STDOUT has already resumed to be "standard'.

That is because I would like to simulate the output buffering trickery I
would normally do with the module as described in my previous post, as I
would like to hide later scripts that they are printing to an in-memory
filehandle. If there are more elegant ways to do so without all these
trouble, please tell me so. Thank you.

I have removed the initial binmode() from my latest test script (see my
other post that I am posting in a few minutes). The original intent was
to set the PerlIO layer on the real STDOUT (not the in-memory one). I
may be able to avoid this.

And I would like to ask, if I binmode(STDOUT, "....."), will the PerlIO
layers installed be lost when I duped it (>&)? You see, I am just duping
filehandles around to make other routines unaware of the extra buffering
layer. If the layers will be lost in the duped filehandle, then you are
right, but I couldn't find anything said in the docs about this behaviour.

Anyway "wide character" warning indicates that you are outputing
unicode to an non-unicode file handle.

I have eliminated the wide character warning in the later test, after I
added ":utf8" to the open() that creates the in-memory filehandle. But
the problem remains.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
How to avoid \x{...} when converting unicode to latin1?	3	Jul 21, 2009
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
File names, character sets and Unicode	1	Dec 12, 2008
Python and unicode	8	Sep 19, 2010
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
unicode by default	29	May 11, 2011

Perl 5.8.x, Unicode and In-memory Filehandles

Bernard Chan

MSG

Bernard Chan

Bernard Chan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads