Writing UTF-8 file under Windows

t_lawetta · Jan 5, 2007

Happy New Year,

Whatever I try to write a UTF-8 file, I always end up with UTF-16LE
with the "FF FE" BOM at the beginning and 2 bytes per character.

I am reading strings from an external resource and try to write to
files.

my $string_with_special_chars = "Château Müller\nGarçon";
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;
close F;

Tried it both on Linux (Perl 5.8.6) and Windows (Perl 5.8.7).
(In case you cannot see it: The string contains the chars with
the corresponding HTML entities acirc, uuml and ccedil.

Opening test.txt with my editor (Ultra-Edit) shows me the correct
string, but in hex view I see the "FF FE" BOM and it shows
2 bytes per character, e.g. 0x43 0x00 for the 'C' and
0xE7 0x00 for the ccedil.

Normally I am reading data via LDAP, so 'use utf8' is not required.
If I add it here, I get:
Malformed UTF-8 character (unexpected non-continuation byte 0x74,
immediately after start byte 0xe2) at ./test.pl line 4.

I tried to make sure my input strings are correctly decoded etc., but
no way.
As long as my strings stay within 7-bit ASCII it is fine, but after
that Perl always things it has to write a BOM and decode in a 2-byte
format.
Using Encode to write utf-8 results in a double encoding or at least
some unreadable chars.

Where does the BOM come from?
Why does Perl add it?
Doesn't Perl write UTF-8 by default?

Thank you for any hints. The issue cost me days already and yes, I have
read a lot about Perl and Unicode.

Tony

Ian Wilson · Jan 5, 2007

Ben said:
Just a data point. I can't reproduce your problem using Perl 5.8.7 on
Linux, although I have to either:

(a) tell Perl the source is UTF-8 with "use utf8;", or
(b) re-write the string using the encoding my Perl expects.

You can, of course, just use Unicode code points in strings:
"Ch\x{E2}teau..." and then you don't need to worry...

That's what I tried, it's easy to use the wrong encoding
in the source text that contains the string literals. By the time I'd
cut and pasted the OP's program I'm pretty sure I ended up with my perl
source file encoded in an obscure codepage.

$ cat utf8.pl
#!/usr/bin/perl
use strict;
use warnings;
#my $string_with_special_chars = "ChÃ¢teau MÃ¼ller\nGarÃ§on";
my $string_with_special_chars =
"Ch\x{00E2}teau M\x{00FC}ller\nGar\x{00E7}on";
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;
close F;

$ perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi

$ perl utf8.pl

$ hexdump -C test.txt
00000000 43 68 c3 a2 74 65 61 75 20 4d c3 bc 6c 6c 65 72 |Châ”œÃ³teau
Mâ”œâ•ller|
00000010 0a 47 61 72 c3 a7 6f 6e |.Garâ”œÂºon|
00000018

$ file test.txt
test.txt: UTF-8 Unicode text

Looks good to me, although I can't be bothered to re-read the UTF8
encoding algorithms to work out manually if the hexdump values agree
with the `file` command's verdict.

Writing a UTF-8 file	1	Jan 5, 2007
CGI and UTF-8	14	Sep 28, 2009
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
UTF-8 problem	8	Aug 21, 2007
Reading Text File Encoding and converting to Perls internal UTF-8 encoding	2	Apr 17, 2009
UTF-8 and strings	44	Jun 7, 2011
Unicode (UTF-8) in C	13	Mar 16, 2014
codec for UTF-8 with BOM	3	May 2, 2011

Writing UTF-8 file under Windows

t_lawetta

Ian Wilson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads