How do I create a new text file with utf-8 encoding

B

bk

I use Activeperl version 5.8.8.817 on windows xp.

I try create a new text file and add some content but when I open it
in notepad, it says its a ansi encoded file. Why?

Here is my code snippit:

open my $fh, '>:encoding(UTF-8)', "testfile.txt";
print $fh "Welcome to Muppet Show\n";
close $fh;

What do I do wrong?
 
J

Jürgen Exner

I use Activeperl version 5.8.8.817 on windows xp.

I try create a new text file and add some content but when I open it
in notepad, it says its a ansi encoded file. Why?

open my $fh, '>:encoding(UTF-8)', "testfile.txt";
print $fh "Welcome to Muppet Show\n";
close $fh;

What do I do wrong?

Your sample text has the identical byte sequence in ASCII, Windows-1252 (aka
ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodings.
Therefore your sample is useless for testing for the correct encoding.

Notepad relies on the byte order mark (BOM) do identify Unicode files,
including UTF-8 where the BOM of course is meaningless and not used except
by Notepad itself. In not so many words: Notepad has no clue what it is
talking about. But for your sample text nor would any other tool.

Step 1: use some sample text that contains characters, that have different
code points in each encoding.
Step 2: don't use Notepad. Write to a (trivial) HTML file and then use a web
browser to view that file. There you can change the encoding and determine,
if those characters are displayed correctly for the desired encoding.

In over 8 years as software localization engineer and international program
manager this has proven to be the only practical and reliable way to
identify the actual encoding of a file.

jue
 
B

Brian McCauley

Your sample text has the identical byte sequence in ASCII, Windows-1252 (aka
ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodings.
Therefore your sample is useless for testing for the correct encoding.

Notepad relies on the byte order mark (BOM) do identify Unicode files,
including UTF-8 where the BOM of course is meaningless and not used except
by Notepad itself.

You mean Windows not Notepad. Most Windows programs will recognise a
file with a utf8 BOM at the start as utf8.

In a situation where you've got a mixture of Windows-1252 and utf8
files knocking about then it's not a bad way to distinguish them. I'm
not saying I particularly liked Microsoft's unilateral adoption of BOM
in utf8 but I have to admit it makes the best of a bad job.

In Perl I'd like to be able to say something like

open my $fh, '>:encoding(UTF-8 BOM)', "testfile.txt";

But AFIAK I can't and I just have to

print $fh "\x{FEFF}"; # BOM
 
J

Jürgen Exner

Brian said:
In a situation where you've got a mixture of Windows-1252 and utf8
files knocking about then it's not a bad way to distinguish them. I'm
not saying I particularly liked Microsoft's unilateral adoption of BOM
in utf8 but I have to admit it makes the best of a bad job.

Fair enough, you got a point.
However calling it a _Byte_Order_ Mark in context of UTF-8 is a misnomer if
there ever has been one ;-)

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top