Writing UTF-8 file under Windows

T

t_lawetta

Happy New Year,

Whatever I try to write a UTF-8 file, I always end up with UTF-16LE
with the "FF FE" BOM at the beginning and 2 bytes per character.

I am reading strings from an external resource and try to write to
files.

my $string_with_special_chars = "Château Müller\nGarçon";
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;
close F;

Tried it both on Linux (Perl 5.8.6) and Windows (Perl 5.8.7).
(In case you cannot see it: The string contains the chars with
the corresponding HTML entities acirc, uuml and ccedil.

Opening test.txt with my editor (Ultra-Edit) shows me the correct
string, but in hex view I see the "FF FE" BOM and it shows
2 bytes per character, e.g. 0x43 0x00 for the 'C' and
0xE7 0x00 for the ccedil.

Normally I am reading data via LDAP, so 'use utf8' is not required.
If I add it here, I get:
Malformed UTF-8 character (unexpected non-continuation byte 0x74,
immediately after start byte 0xe2) at ./test.pl line 4.

I tried to make sure my input strings are correctly decoded etc., but
no way.
As long as my strings stay within 7-bit ASCII it is fine, but after
that Perl always things it has to write a BOM and decode in a 2-byte
format.
Using Encode to write utf-8 results in a double encoding or at least
some unreadable chars.

Where does the BOM come from?
Why does Perl add it?
Doesn't Perl write UTF-8 by default?

Thank you for any hints. The issue cost me days already and yes, I have
read a lot about Perl and Unicode.

Tony
 
I

Ian Wilson

Ben said:
Just a data point. I can't reproduce your problem using Perl 5.8.7 on
Linux, although I have to either:

(a) tell Perl the source is UTF-8 with "use utf8;", or
(b) re-write the string using the encoding my Perl expects.

You can, of course, just use Unicode code points in strings:
"Ch\x{E2}teau..." and then you don't need to worry...

That's what I tried, it's easy to use the wrong encoding
in the source text that contains the string literals. By the time I'd
cut and pasted the OP's program I'm pretty sure I ended up with my perl
source file encoded in an obscure codepage.

$ cat utf8.pl
#!/usr/bin/perl
use strict;
use warnings;
#my $string_with_special_chars = "Château Müller\nGarçon";
my $string_with_special_chars =
"Ch\x{00E2}teau M\x{00FC}ller\nGar\x{00E7}on";
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;
close F;

$ perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi

$ perl utf8.pl

$ hexdump -C test.txt
00000000 43 68 c3 a2 74 65 61 75 20 4d c3 bc 6c 6c 65 72 |Château
M├â•ller|
00000010 0a 47 61 72 c3 a7 6f 6e |.Garçon|
00000018

$ file test.txt
test.txt: UTF-8 Unicode text

Looks good to me, although I can't be bothered to re-read the UTF8
encoding algorithms to work out manually if the hexdump values agree
with the `file` command's verdict.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top