B
Bernard Chan
Hello all,
I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.
I have written a module IO::OutputBuffer which is expected to be used as
follows:
$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original
Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).
Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.
As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.
I have minimized the process to a script as short as below:
================================================
#!/usr/bin/perl -w
binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding
BEGIN {
require "require.pl";
}
#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;
open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;
print (join("<br>\n", @lines));
#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;
print $io_sys $buffered_content;
====================================
However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:
Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
<br>
<br>
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}
I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.
Anyone can help me? Thank you.
Regards,
Bernard Chan.
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.
I have written a module IO::OutputBuffer which is expected to be used as
follows:
$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original
Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).
Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.
As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.
I have minimized the process to a script as short as below:
================================================
#!/usr/bin/perl -w
binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding
BEGIN {
require "require.pl";
}
#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;
open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;
print (join("<br>\n", @lines));
#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;
print $io_sys $buffered_content;
====================================
However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:
Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
<br>
<br>
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}
I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.
Anyone can help me? Thank you.
Regards,
Bernard Chan.
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***