Read/write with UCS-2* encodings - Possible???

I

Ilya Zakharevich

What should one fix to make UCS-2* encodings work on the file handles
in Perl? E.g., should not

perl -wlpe "BEGIN{binmode STDOUT, q:)encoding(UCS-2));}" < xyz > xyz1

just work? Currently, it requires additional `binmode STDOUT' in
advance, which changes the semantic. And doing

binmode STDOUT;
binmode STDOUT, q:)encoding(UCS-2));
binmode STDOUT, q:)crlf);

does not work, since :crlf layer is put AFTER :encoding, not before it
as expected...

Another indication is that

piconv -t UCS-2

gives wrong results on DOSISH platforms (which is not surprizing,
since the version I have uses q:)encoding(UCS-2))).

For best results, I would prefer a solution which allows doing

binmode STDOUT, q:)encoding(UCS-2));

and

binmode STDOUT, q:)crlf);

in arbitrary order so that the result does not depend on the order
(as now), but works ;-/.

Thanks,
Ilya
 
M

Marc Lucksch

Ilya said:
What should one fix to make UCS-2* encodings work on the file handles
in Perl? E.g., should not

I have no idea about UCS-2, but I had the same problems in UTF16
binmode STDOUT;
binmode STDOUT, q:)encoding(UCS-2));
binmode STDOUT, q:)crlf);

does not work, since :crlf layer is put AFTER :encoding, not before it
as expected...

Another indication is that

piconv -t UCS-2

gives wrong results on DOSISH platforms (which is not surprizing,
since the version I have uses q:)encoding(UCS-2))).

To make perl generate UTF-16, UTF-32 files that can be used by the
Windows Editor (CRLF), I used a small trick of which I also put in my
Sofu module:
http://search.cpan.org/~maluku/Sofu-0.3/lib/Data/Sofu.pm#NOTE_on_Unicode

#Write Windows CRLF UTF-16 Files
open my $fh,">:raw:encoding(UTF-16):crlf:utf8","out.sofu";

#Write Unix UTF-16 Files
open my $fh,">:raw:encoding(UTF-16)","out.sofu";
#Same goes for UTF-32

print $fh chr(65279); #Print UTF-8 Byte Order Mark (Some programs
want it, some programs die on it...)

When I tested it, it worked both on Windows and Linux.

Maybe this helps you

Marc "Maluku" Lucksch
 
P

Peter J. Holzer

To make perl generate UTF-16, UTF-32 files that can be used by the
Windows Editor (CRLF), I used a small trick of which I also put in my
Sofu module:
http://search.cpan.org/~maluku/Sofu-0.3/lib/Data/Sofu.pm#NOTE_on_Unicode

#Write Windows CRLF UTF-16 Files
open my $fh,">:raw:encoding(UTF-16):crlf:utf8","out.sofu";

Why the ":utf8"? It doesn't make sense to me (you want UTF-16, not
UTF-8, and you most definitely don't want to double-encode), and it
doesn't seem to make any difference anyway.

print $fh chr(65279); #Print UTF-8 Byte Order Mark (Some programs
want it, some programs die on it...)

:encoding(UTF-16) already causes a BOM to be written, so this writes a
second BOM.

hp

PS: Only tested with 5.10.0.
 
M

Marc Lucksch

Peter said:
Why the ":utf8"? It doesn't make sense to me (you want UTF-16, not
UTF-8, and you most definitely don't want to double-encode), and it
doesn't seem to make any difference anyway.
As far as I remember when I wrote that, I ended up getting a warning
about wide characters from the :crlf layer, it doesn't happen anymore in
perl5.10.0 (see below). So it is not needed anymore (still work though)

The :utf8 layer somehow makes the next layer accept wide characters even
if it shouldn't.

It tested it with perl5.8.1 when I did that. And the test routine I
wrote for this also works for perl5.10.0. (in Data::Sofu 0.3).

:encoding(UTF-16) already causes a BOM to be written, so this writes a
second BOM.
Yeah, that was my error, I didn't delete that line.. There were two
lines before that discribing how to make UTF-8. and that line belonged
to that.

Take my old test script:

#!/usr/bin/perl
use strict;
use warnings;

open my $fh,">:raw:encoding(UTF-16):crlf:utf8","windows.txt";
print $fh "Hello\nWorld";
close $fh;

open $fh,">:raw:encoding(UTF-16)","unix.txt";
print $fh "Hello\nWorld";
close $fh;

# this is logical, since perlIO layers are from left to right
open $fh,">:raw:encoding(UTF-16):crlf","logical.txt";
print $fh "Hello\nWorld";
close $fh;

# this is unlogical, since perlIO layers are from left to right, but test it
# anyway
open $fh,">:raw:crlf:encoding(UTF-16)","unlogical.txt";
print $fh "Hello\nWorld";
close $fh;

# In Windows :crlf is the default
open $fh,">:encoding(UTF-16)","justutf-16.txt";
print $fh "Hello\nWorld";
close $fh;


Test on Windows with perl5.10.0

windows.txt:
Editor:
Hello
World
Vim:
Hello
World
[converted][noeol]

unix.txt:
Editor:
HelloWorld
Vim:
Hello
World
[unix][converted][noeol]


And now the others:

logical.txt:
Works as the windows.txt one (didn't for me before)
And there is no more warning. *happy*


Unlogical:
VIM:
þÿ\0H\0e\0\l\0l\0o\0
\0W\0o\0r\0l\0d
[noeol]
Editor:
Hello਀圀漀爀氀

justutf-16.txt:
Same as unlogical.txt, which is strange

So to conclude:
open my $fh,">:raw:encoding(UTF-16):crlf:utf8","windows.txt";# working
open $fh,">:raw:encoding(UTF-16)","unix.txt"; #working for unix

open $fh,">:raw:encoding(UTF-16):crlf","logical.txt"; #Working in 5.10.0

When I add "\x{343f}" to the string it still works well in the Editor,
but my VIM7.2 won't read the files anymore. :(

Marc "Maluku" Lucksch
 
I

Ilya Zakharevich

(You presumably know you can use

binmode STDOUT, ":raw:encoding(UCS-2):crlf";

rather that three separate statements?)

No, I did not. And adding :utf8 at the end fixes a warning as well.

So now the question boils down to: how to make

binmode STDOUT, 'encoding(UCS-2)';

done on a filehandle which is in :crlf mode do the moral equivalent of

binmode STDOUT, ":raw:encoding(UCS-2):crlf";

And how one would easily switch :crlf layer off on such a handle?
Doing just `binmode' switches off encoding as well; and my perl does
not support :lf...

(When this works, a lot of programs would magically start to work as expected.)
That's just a bug in piconv, then. It should binmode its filehandles if
it's writing potentially binary data.

How would it know this? What is the semantic of binmode()? As usual,
the documentation is close to useless:

The directives alter the behaviour of the file handle.

Thank a lot!!! HOW do they alter the behaviour??? Is the intent to
be incremental :)crlf does not change the encoding layers), or is the
semantic to remove all the layers, and add the specified ones?
That would be... weird. It matters whether the LF->CRLF conversion is
done before or after the characters->UCS-2 conversion, since you get
different results. There's already more than enough weirdness in PerlIO
without adding more.

No, I think this is not adding weirdness, but using it. IIRC, the
'binmode' directive is passed through the layers, and they have a
possibility to handle it.

The wide-char encoding layer should notice that somebody wants to add
:crlf, and should not let it pass through itself: the newly created
layer should be anchored `before' the encoding layer.

Yours,
Ilya
 
M

Marc Lucksch

Ilya said:
binmode STDOUT, ":raw:encoding(UCS-2):crlf";

And how one would easily switch :crlf layer off on such a handle?
Doing just `binmode' switches off encoding as well; and my perl does
not support :lf...

binmode STDOUT ":pop"; #Removes the topmost layer

See perldoc perlio:
 
I

Ilya Zakharevich

Yes, but I don't think this will change either.

Sorry, can't parse your sentence... (Removing :crlf has no bad side
effects, and a lot of good ones [like some programs suddenly starting
to produce non-junk ;-].)
Yes it does.

If you think so, your mental picture of text/vs/binary is IMO
seriously screwed up... Not surprising if you claim little experience
with DOSISH systems.
'Text' as used by perl means 'some 8bit extension of ASCII,
with reasonably short lines delimited with the OS newline sequence, no
non-spacing control characters and no NULs'.

Nope, `text' means non-`binary'. And `binary' means that preservation
of the exact layout of bits is tantamount. So changing encoding on
binary data is a nonono.
By this definition UCS-2 is
(and most wide encodings are) 'binary'. -B will return true on a UCS-2
file, for instance.

-B makes no sense at all in today's world. Anyway, even when it had,
it has no relation to :crlf etc.
Because perl 5.8 was released a good while ago, so people have been
working with and around the current behaviour for some time. Changing it
now would almost certainly break working code,

I claim that removing :crlf can't break any code. I suspect that
reinserting it on top of :encoding(UCS2) would also have no
detrimental side effects, but to check this, one needs more knowledge
about (IMO, completely botched) behaviour of PerlIO.

In short, if "remove :crlf" worked AFTER "insert :encoding(UCS2)", one
must make sure that it still does. (Well, given the current pitiful
state of Perl, it might be that just `applying voodoo programing' may be
a sufficient justification - just do several experiments on how Perl
behaves without any inspection of source code...)
whereas a new layer on CPAN would not,

.... and would not fix thousands of programs which do not work now...
and would allow those with old perls to get the new behaviour if
they want it.

This does not make any sense to me. The problem is not with "those
who want it" (there are too few of them), but with "those who need it"
(read: anyone working on DOSISH, or writing code which can potentially
be used on DOSISH platforms).
It's possible that some sort of flag to :encoding (or to 'use
PerlIO::encoding') would be OK.

No. What is OK is to have correct behaviour. If having correct
behaviour has a non-0 (but still negligible) chance of breaking old
stuff, one should be able to request bug-for-bug compatibility by
environment variable.

Yours,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top