Regular expression for BOM required

P

Peter Gordon

#!/cygdrive/c/cygwin/bin/perl
use strict;
use warnings;
use 5.14.0;
open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
\n";
while( <$fh> ) {
say "Found regular expression" if /\xFE\xFF/;
# say "Found it!" if s/\A.*nm=//;
print;
}

# I'm trying to match a byte order mask in a file. Below is
# the start of an octal dump of the file.
# 0000000 177377 000156 000155 000075 000142 000157 000164 000164
# The line:
# say "Found it!" if s/\A.*nm=//;
# works correctly, but I can't write a regular expression which matches
# octal 0000000 177377 at the start of a line. Help with the
# regular expression would be appreciated.
# If it matters, I'm working on Windows 7.
 
P

Peter J. Holzer

#!/cygdrive/c/cygwin/bin/perl
use strict;
use warnings;
use 5.14.0;
open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
\n";
while( <$fh> ) {
say "Found regular expression" if /\xFE\xFF/;

You want to match the single character U+FEFF BOM here, not a sequence
of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
LETTER Y WITH DIAERESIS.

So you have to write

say "Found regular expression" if /\x{FEFF}/;
print;
}

# I'm trying to match a byte order mask in a file. Below is
# the start of an octal dump of the file.
# 0000000 177377 000156 000155 000075 000142 000157 000164 000164
^^^^^^
The default output format of od (little endian 16 bit values in octal)
is confusing. Yes, 0xFEFF is 0177377 in octal, but 177377 looks too much
like 7FFF for me to do the bitshift intuitively in my head.

Better to use "od -tx1" or "od -tx2".

hp
 
P

Peter Gordon

You want to match the single character U+FEFF BOM here, not a sequence
of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
LETTER Y WITH DIAERESIS.

So you have to write

say "Found regular expression" if /\x{FEFF}/;

print;
}
Thanks Peter,
It was the curly braces which I was missing.
 
P

Peter J. Holzer

Presumably you also have to check for the "other order" ?

No. After decoding there is no byte order any more, just characters, and
the character you want to match is \x{FEFF}.

If you try to open a big-endian file with :encoding(utf16le), the script
will die trying to read the first line.

(If you open it with :encoding(utf16), the BOM will be used to determine
endianness and *not* passed through - this seems a little inconsistent
to me)

hp
 
P

Peter Gordon

Presumably you also have to check for the "other order" ?

BugBear
The files I'm editing are the playlists of Zoomplayer which is
an Israeli media player, thus they are consistent in their Unicode
and format. Is there a method for getting Unicode to work with
the combination of the diamond operator and In-place editing?
The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
but crashes when I try to run it with the -i command line option. eg:
$perl -i insertTT.pl aa.zpl

#!/cygdrive/c/cygwin/bin/perl
# Used to insert a "tt=NUMBER: " line in a new .df files.
use strict;
use warnings;
use 5.14.0;
use Encode qw(encode decode);
use open qw:)std IN :encoding(utf16-le));

# $^I = ".bak";
my $first = 1;
while( <> ) {
my $line = $_;
if ( $first == 1 ) {
$line =~ s/\x{FEFF}nm=(.*)/nm=$1/;
$first = 0;
}
$line = decode("utf8", $line);
print $line;
if ( $line =~ /nm=/ ) {
my $num = $line;
chomp($num);
$num =~ s/nm=.*?(\d+).*/$1/;
print "tt=$num: \n";
}
}
 
P

Peter J. Holzer

The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
but crashes when I try to run it with the -i command line option. eg:

If perl crashes you should file a bug report.

hp
 
P

Peter J. Holzer

Peter said:
Peter Gordon wrote:
(e-mail address removed): [$_ was read from a file opened with ":encoding(utf16le)"]
say "Found regular expression" if /\x{FEFF}/; [...]
Presumably you also have to check for the "other order" ?

No. After decoding there is no byte order any more, just characters, and
the character you want to match is \x{FEFF}.

If you try to open a big-endian file with :encoding(utf16le), the script
will die trying to read the first line.

(If you open it with :encoding(utf16), the BOM will be used to determine
endianness and *not* passed through - this seems a little inconsistent
to me)

I had (perhaps wrongly) assumed that the OP's true intent (or need)
was to read the BOM and use it to decide *which* byte order
was being used, and hence to use the correct decoder.

If that was the intent of the OP, opening the file in one byte order and
checking for a reversed BOM wouldn't work: The diamond operator dies
when it encounters the wrong BOM (of course you could catch the
exception and then try the other endianness).

I think there are two good ways to open UTF-16 files with unknown byte
order:

1) The carefree method: Just use :encoding(utf16), and it will
automatically determine the endianness from the BOM, and you don't
have to care whether the file is little or big endian. Plus, the BOM
is automatically filtered out so you don't have to. On the flipside,
you lose the information about the endianness and the BOM, so if you
need that, this isn't for you.

2) Open the file in binary mode and read the first few bytes. Determine
the correct encoding from those, rewind and set the encoding layer.
This is more work, but a lot more flexible: You can detect any
encoding you want.

As always, there are probably more ways to do it.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top