Opening Unicode files?

Ilya Zakharevich · Dec 25, 2011

Does Perl ship with a simple method of opening Unicode files? E.g., I
would like to have something like

open my $fh, '< :BOM0or(utf8)', $filename

where BOM0or does what Perl itself does for Perl files: it looks for the
first 4 bytes; given that a Perl file starts in ASCII, one can detect
BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
is none of the above (then the arument in parens explains what to do;
e.g., Perl itself does BOM0or(latin1)).

Likewise, if one does not know that the file starts in ASCII, one can
still detect BOM (which does not appear often in the encodings I know)
so one could do :BOMor(utf8). Do not recollect seeing such support
for files open()ed by Perl programs; is there?

Thanks,
Ilya

r.mariotti · Dec 26, 2011

Does Perl ship with a simple method of opening Unicode files? E.g., I
would like to have something like

open my $fh, '< :BOM0or(utf8)', $filename

where BOM0or does what Perl itself does for Perl files: it looks for the
first 4 bytes; given that a Perl file starts in ASCII, one can detect
BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
is none of the above (then the arument in parens explains what to do;
e.g., Perl itself does BOM0or(latin1)).

Likewise, if one does not know that the file starts in ASCII, one can
still detect BOM (which does not appear often in the encodings I know)
so one could do :BOMor(utf8). Do not recollect seeing such support
for files open()ed by Perl programs; is there?

Thanks,
Ilya

Here's what I use and it seems to do what's needed:

use File::BOM qw( :all );

# Open specified input file
open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
file ($IF)!\n";

Ilya Zakharevich · Dec 27, 2011

Thinking about it more, there are 3 situations:

a) we know that the first 2 characters in the file are 7-bit, and
are not 0. Then read the first 2 bytes; if both 0, it is 32BE
(possibly with [hardly legal] BOM); if BOM-BE, it is 16BE+BOM; if
high bits are set, it is UTF-8+BOM; if the first byte is 0, it is
16BE.

One needs to read the other 2 bytes only if 32BE is detected (and
only if one wants to guard against BOM) and if the second byte is
0 - then it may be 16LE or 32LE.

The only possible confusion is whether the file is actually in
Unicode encoding, or in an 8-bit encoding (or between UTF-7 and
UTF-8-no-BOMs).

b) The only thing known is that the first 2 chars are not 0. Again,
one reads 2 bytes - but now there is no way to detect UTF-8-BOM.

c) The only thing known is that the fist 2 chars are 7-bit. Then
there is no way to detect BOMless UTF-16.

d) General case: 8-bit chars may be present.

It looks like the decision algorithms are DIFFERENT in these 4 cases;
hence one needs 4 different "filters": One can call them BOM07, BOM08,
BOM7, and BOM8.

Here's what I use and it seems to do what's needed:

use File::BOM qw( :all );

And do you know from which version it is shipped with Perl?

# Open specified input file
open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
file ($IF)!\n";

Do not see how this may be related: I see no way to inform the filter
about what is known in advance...

Thanks,
Ilya

Ilya Zakharevich · Dec 27, 2011

Encode::Guess, which can be invoked as

open my $fh, '< :encoding(Guess)', $filename

Somewhat annoyingly, you have to explicitly use Encode::Guess or it
won't recognise the encoding name, and you have to use
Encode::Guess->set_suspects to set the list of encodings to try.

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Thanks,
Ilya

r.mariotti · Dec 28, 2011

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Thanks,
Ilya

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

Similar source available with activesatate for windows

Ilya Zakharevich · Dec 31, 2011

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

I never do "as all perl mongers do". Neither, I expect, do users of
my code.

Hope this helps,
Ilya

Tim McDaniel · Jan 2, 2012

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

I was a maintainer of servers at previous jobs and could do that for
the system. But not at my current job, and if I wanted to do it for a
shared script, I don't know yet how receptive they would be to a
request. It's why I "use constant" instead of a more modern and
convenient module.

tchrist · Feb 15, 2012

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Ilya,

I understand completely. I find that Encode::Guess is too unreliable for
my purposes. I have a replacement version that is built on a statistical
model derived from very large English-language corpora, which it gets
right 99.79% of the time, including on conflicting 8-bit encodings. For
example, it knows CP1252 from MacRoman from ISO-8859-1 from ISO-8859-15,
etc. I have a working alpha version of the code, so if you are interested in this
technique or wish to know more, please send me mail. You can fetch the
alpha version from

http://training.perl.com/scripts/Encode-Guess-Educated-0.03.tar.gz

I'm having trouble with my PAUSE id, so it isn't on CPAN yet.

Hope this helps, and do feel free to write. I never look here for anything,
so am likely to miss a reply.

--tom

Unicode help please	5	Oct 19, 2013
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Error in Handling Unicode(UTF16-LE) File & String	4	May 6, 2008
new encodings in 1.8	0	Mar 25, 2014
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
Writing UTF-8 file under Windows	1	Jan 5, 2007
Using streams for opening files in unicode.	1	Jan 9, 2007
Using streams for opening files in unicode.	0	Jan 9, 2007

Opening Unicode files?

Ilya Zakharevich

r.mariotti

Ilya Zakharevich

Ilya Zakharevich

r.mariotti

Ilya Zakharevich

Tim McDaniel

tchrist

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads