Opening Unicode files?

I

Ilya Zakharevich

Does Perl ship with a simple method of opening Unicode files? E.g., I
would like to have something like

open my $fh, '< :BOM0or(utf8)', $filename

where BOM0or does what Perl itself does for Perl files: it looks for the
first 4 bytes; given that a Perl file starts in ASCII, one can detect
BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
is none of the above (then the arument in parens explains what to do;
e.g., Perl itself does BOM0or(latin1)).

Likewise, if one does not know that the file starts in ASCII, one can
still detect BOM (which does not appear often in the encodings I know)
so one could do :BOMor(utf8). Do not recollect seeing such support
for files open()ed by Perl programs; is there?

Thanks,
Ilya
 
R

r.mariotti

Does Perl ship with a simple method of opening Unicode files? E.g., I
would like to have something like

open my $fh, '< :BOM0or(utf8)', $filename

where BOM0or does what Perl itself does for Perl files: it looks for the
first 4 bytes; given that a Perl file starts in ASCII, one can detect
BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
is none of the above (then the arument in parens explains what to do;
e.g., Perl itself does BOM0or(latin1)).

Likewise, if one does not know that the file starts in ASCII, one can
still detect BOM (which does not appear often in the encodings I know)
so one could do :BOMor(utf8). Do not recollect seeing such support
for files open()ed by Perl programs; is there?

Thanks,
Ilya


Here's what I use and it seems to do what's needed:

use File::BOM qw( :all );

# Open specified input file
open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
file ($IF)!\n";
 
I

Ilya Zakharevich

Thinking about it more, there are 3 situations:

a) we know that the first 2 characters in the file are 7-bit, and
are not 0. Then read the first 2 bytes; if both 0, it is 32BE
(possibly with [hardly legal] BOM); if BOM-BE, it is 16BE+BOM; if
high bits are set, it is UTF-8+BOM; if the first byte is 0, it is
16BE.

One needs to read the other 2 bytes only if 32BE is detected (and
only if one wants to guard against BOM) and if the second byte is
0 - then it may be 16LE or 32LE.

The only possible confusion is whether the file is actually in
Unicode encoding, or in an 8-bit encoding (or between UTF-7 and
UTF-8-no-BOMs).

b) The only thing known is that the first 2 chars are not 0. Again,
one reads 2 bytes - but now there is no way to detect UTF-8-BOM.

c) The only thing known is that the fist 2 chars are 7-bit. Then
there is no way to detect BOMless UTF-16.

d) General case: 8-bit chars may be present.

It looks like the decision algorithms are DIFFERENT in these 4 cases;
hence one needs 4 different "filters": One can call them BOM07, BOM08,
BOM7, and BOM8.
Here's what I use and it seems to do what's needed:

use File::BOM qw( :all );

And do you know from which version it is shipped with Perl?
# Open specified input file
open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
file ($IF)!\n";

Do not see how this may be related: I see no way to inform the filter
about what is known in advance...

Thanks,
Ilya
 
I

Ilya Zakharevich

Encode::Guess, which can be invoked as

open my $fh, '< :encoding(Guess)', $filename

Somewhat annoyingly, you have to explicitly use Encode::Guess or it
won't recognise the encoding name, and you have to use
Encode::Guess->set_suspects to set the list of encodings to try.

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Thanks,
Ilya
 
R

r.mariotti

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Thanks,
Ilya


Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

Similar source available with activesatate for windows
 
I

Ilya Zakharevich

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

I never do "as all perl mongers do". Neither, I expect, do users of
my code.

Hope this helps,
Ilya
 
T

Tim McDaniel

Do as all perl mongers do - use CPAN to locate, download and install
the needed function.

$>perl -MCPAN -e shell

I was a maintainer of servers at previous jobs and could do that for
the system. But not at my current job, and if I wanted to do it for a
shared script, I don't know yet how receptive they would be to a
request. It's why I "use constant" instead of a more modern and
convenient module.
 
T

tchrist

Same question as to the other answer: does it ship with Perl? And I
do not want any guessing; I want a very deterministic procedure...

Ilya,

I understand completely. I find that Encode::Guess is too unreliable for
my purposes. I have a replacement version that is built on a statistical
model derived from very large English-language corpora, which it gets
right 99.79% of the time, including on conflicting 8-bit encodings. For
example, it knows CP1252 from MacRoman from ISO-8859-1 from ISO-8859-15,
etc. I have a working alpha version of the code, so if you are interested in this
technique or wish to know more, please send me mail. You can fetch the
alpha version from

http://training.perl.com/scripts/Encode-Guess-Educated-0.03.tar.gz

I'm having trouble with my PAUSE id, so it isn't on CPAN yet.

Hope this helps, and do feel free to write. I never look here for anything,
so am likely to miss a reply.

--tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top