Problem with join and unicode

derek / nul · Aug 13, 2003

I have a problem with the join command inserting and extra \n on the end of
every line.
I don't seen to be able to get rid of them.

@linesengwag1 is a list of lines of a parameters for M$ Train Sim
I am 'join'ing the list so that I can encode it back to unicode.

I am trying to use s/// to get rid of the extra \n's but the program gives no
warnings but stops after the first line (ie where the first \n is).

Am I doing anything wrong?

==================================================
# win32 Activestate 5.8.0
use strict;
use warnings;

$linesengwagu = join("\n",@linesengwag1);
$linesengwagun = encode("UTF16LE", $linesengwagu);

#
$linesengwagun =~ s/"\x0D","\x00","\x0A","\x00"/"\x0D","\x00"/
#

open ENGWAG, ">$currentlongengfile", or die "Cannot open
binmode ENGWAG;
print ENGWAG "\xFF","\xFE",$linesengwagun;
close ENGWAG;

Simon Oliver · Aug 13, 2003

derek said:
I have a problem with the join command inserting and extra \n on the end of
every line.
I don't seen to be able to get rid of them.

@linesengwag1 is a list of lines of a parameters for M$ Train Sim
I am 'join'ing the list so that I can encode it back to unicode.

I am trying to use s/// to get rid of the extra \n's but the program gives no
warnings but stops after the first line (ie where the first \n is).

Quick guess (I don't have time to test) - try using s///s or enclose the
routine in a block and redifine the input record separator:

{
local $/;
# do stuff
}

derek / nul · Aug 13, 2003

"extra" to what?

extra to the \r\n's that are already there

Presumably it's you who's putting them there.

it would appear so

But where did you get them from? Your snippet doesn't show that.
Maybe those lines already end with a newline, and you're adding
another one with join() ? You aren't really showing us enough of your
working to be sure.

The list does indeed have \r\n's in it before the join

(Since you're working with 5.8.0, it seems to me that you're liable to
cause yourself additional problems by trying to work in binary (i.e
bytes mode). I'd recommend working in Perl's native unicode
representation utf8, i.e character mode) and using I/O layers to
convert to/from the utf-16LE that you apparently need externally.)

pointers please

I'd look to not causing the damage in the first place, rather than
trying to repair the damage afterwards.

I have tried that but only cause more errors

Isn't that normal for s/// without the appropriate option flag? See
perlop or the usual Perl tutorial materials.

looking now, thanks

< http://www.perldoc.com/perl5.8.0/pod/perlfaq6.html
#I'm-having-trouble-matching-over-more-than-one-line.--What's-wrong- >

But you're concentrating on the wrong problem at that point, IMHO.

I am quite willing to accept that!!

Alan J. Flavell · Aug 13, 2003

extra to the \r\n's that are already there

Yup, I _said_ you already had some (and it appears you read them in
binmode(), otherwise there wouldn't have been \r to cope with as
well).

it would appear so

Well, you _did_ write this:

|| $linesengwagu = join("\n",@linesengwag1);

What more can I say than "don't do that!" ;-)

When you say "back" to unicode - are you implying that it was unicode
to start with? (presumably utf-16LE also??). If so, then some of the
operations you're carrying out on it in binmode() seem wrong to me,
and unnecessarily confusing. That's why I recommended you go for
Perl's native unicode text coding internally.

There was a discussion here fairly recently in which utf-16LE was
involved, maybe it'll help if you hunt it down.

pointers please

perluniintro and perlunicode, in your own documentation set or at
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

I have tried that

How, exactly?

but only cause more errors

Speak out, man! The PSI::ESP module is on the blink...

Are you perhaps looking for join('', @linesengwag1); ?

Good luck

Tassilo v. Parseval · Aug 14, 2003

Also sprach derek / nul:

Tad, I removed the quotes from this line and the program fails to run past the
first line in the eng file??

You mean, when you write:

open ENGWAG, $File::Find::name or die $!;

you only get the first line from that file? I promise you that this is
not the case. There must be something else that you changed.

Tassilo

derek / nul · Aug 14, 2003

Also sprach derek / nul:

You mean, when you write:

open ENGWAG, $File::Find::name or die $!;

you only get the first line from that file? I promise you that this is
not the case. There must be something else that you changed.

About 4 levels down the program, I am reading another file, and that one fails
at the end of the first line, ie where the \n is.

derek / nul · Aug 18, 2003

One thing that I tried was putting a :crlf layer after the
:encoding(utf16le) on the open statement. Well, this then resulted
in the newlines being handled as expected, but it somehow screwed-up
the recognition of the BOM. If the text contained any non-ASCII
characters I'm concerned that it would upset those too?

Yes, sure, I _could_ do what the original poster was aiming at,
reading the stuff in binary, decoding it explicitly, and fooling with
the details of newlines for myself. But if the wheel has already been
invented, I wanna use it, right?

At this point I decided that I didn't really understand what the
documentation was telling me to do, so I decided to ask. Help?

Alan, thanks for that.
At least I am not going mad.
I had got to the point where I knew that x0d0a was part of the problem, I just
didn't know why.

anyone else that would like a copy of an M$ UTF16LE file
http://www.sgrail.org/files/gp9.zip

Derek

Alan J. Flavell · Aug 18, 2003

Alan, thanks for that.
At least I am not going mad.

Yes, at this stage at least I think I owe you an apology.

I've worked sufficiently with unicode data in Perl 5.8 in unix-ish
situations to feel that I have a confident grasp of the character
handling features; I'm afraid I had taken it for granted that it was
going to work reasonably in Windows too. But unless someone can show
a vital bit of magic that I'm missing, then at the moment I'm afraid I
can't see it working in quite the way I had intended.

The data characters work just great, but the newlines are shot to
hell, and in fact I found that it's even worse with output than I had
already described for input. I've ended up with an output file, meant
to be utf16le format, which "od -x"[1] says contains the following
piece of nonsense:

.... 000d 0a0d 0d00 0d00 000a ...

which is clearly useless.

[1] "od" is a Cygwin command.

I played with the WIDE_SYSTEM_CALLS setting but it doesn't seem to
make any difference.

I had got to the point where I knew that x0d0a was part of the
problem, I just didn't know why.

I think at this stage (and unless/until someone produces a working
answer) your previous approach - which I was at first trying to avoid
- is indeed going to offer a more practical way right now: read/write
in binmode(), and do the encode/decode of the data as an explicit
move, separate from newline handling. (But that can all be packaged
together as reading and writing routines, with a clean programming
interface to the rest of the code.)

I'd urge you to start by playing with a cut-down program which does
little more than read the input, print whatever diagnostics you think
fit to the "console" (STDOUT), and print the data to an output file.

When you can accurately reproduce the input file in the output file,
and feel you understand how and why it's working, then it's time
enough to build up to the specific program that you wanted to get.
That's my advice, for what it's worth, anyway (which may not be very
much, after this incident with the newlines...)

all the best

Sniffing encoding type by looking at file BOM header	2	Mar 24, 2010
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode lists and join (python 2.2.3)	1	May 25, 2008
How to replace UniCode representation with actual character?	6	Dec 18, 2013
Regular expression for BOM required	6	Jan 12, 2013
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Problem with join in__str__() in class (newbie)	6	Aug 9, 2009
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014

Problem with join and unicode

derek / nul

Simon Oliver

derek / nul

Alan J. Flavell

Tassilo v. Parseval

derek / nul

derek / nul

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads