Perl: Win-32 vs. linux

Maqo · Jun 2, 2005

Is there any reason the following would work on a linux installation of
Perl, but not using ActivePerl-5.8 on a Win-32 system? The tr///
operation successfully removes UTF-8 encoded   characters from the
string in linux, but not Win-32, even after verifying that all required
modules are installed. Any thoughts would be greatly appreciated!

-----------------------------------------------------------------------

use LWP::Simple;
use Encode;

my $URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2004/IO_07_04.htm";

$content = get($URL);
$decoded = decode("utf-8"=>$content);
$decoded =~ tr/\x{00a0}/ /;

print $decoded;

John Bokma · Jun 2, 2005

Maqo said:
Is there any reason the following would work on a linux installation
of Perl, but not using ActivePerl-5.8 on a Win-32 system? The tr///
operation successfully removes UTF-8 encoded   characters from
the string in linux, but not Win-32, even after verifying that all
required modules are installed. Any thoughts would be greatly
appreciated!

-----------------------------------------------------------------------

use LWP::Simple;
use Encode;

my $URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2004/IO_07_04
.htm";

$content = get($URL);
$decoded = decode("utf-8"=>$content);
$decoded =~ tr/\x{00a0}/ /;

print $decoded;

By best guess: decode maps to the internal encoding used by Perl, and I
guess on Windows this is Win-something, and on Linux ISO-something.

Why not remove the utf-8 encoded non-breakable spaces before the decoding
step?

Dave · Jun 2, 2005

John Bokma said:
By best guess: decode maps to the internal encoding used by Perl, and I
guess on Windows this is Win-something, and on Linux ISO-something.

Why not remove the utf-8 encoded non-breakable spaces before the decoding
step?

two thoughts and a question:

1) does adding:

use utf8;

make any difference?

2) Print $decoded to a file before the tr// step to see if it has a string
  or U+00a0 or just a mess. This will help in finding where in the code
the problem is.

Question.

Where in the script is the   string in the original html being parsed
into a U+00a0 ? Does get() do this automatically? I can't see anything in
the documentation that mentions this, but I can't see why it would happen on
a Linux system and not on Windows.

Dave

Alan J. Flavell · Jun 2, 2005

By best guess: decode maps to the internal encoding used by Perl,
and I guess on Windows this is Win-something, and on Linux
ISO-something.

I don't know why you think it's appropriate to "guess" this. The Perl
documentation is pretty clear about how characters are stored
internally (perldoc perlunicode), and if there *was* a difference, one
would expect to find it in the appropriate platform-specific perl
documentation.

The only thing that comes to mind is if the code calls Win32 *system*
functions, it may be necessary to run it with "wide system calls"
enabled. But that doesn't appear to be happening here.

If this was my problem, I'd be inclined to prepare a small test
document which I /knew/ contained these actual characters (as opposed
to containing   character entity references, I mean), rather than
relying on some massive web document from elsewhere; and print out in
detail what's going on internally. But that's only for diagnosis
purposes: Perl's unicode implementation works best when you just use
it, not mess around with internals.

(I really can't be bothered to wade through the whole mess of HTML and
javascript contained at the cited URL to get further with this,
sorry.)

Why not remove the utf-8 encoded non-breakable spaces before the
decoding step?

It worries me that the questioner writes:

| The tr/// operation successfully removes UTF-8 encoded  
| characters from the string in linux, but not Win-32

I see that the source contains quite a number of   character
entitity references. So the question is, are we really talking about
no-break space *characters*, or are we talking about their character
entity references?

If we're really talking about *characters*, then note the "Caveat" in
the documentation for decode() in Encode:

When you run $string = decode("utf8", $octets), then $string may not
be equal to $octets. Though they both contain the same data, the utf8
flag for $string is on unless $octets entirely consists of ASCII data
(or EBCDIC on EBCDIC machines). See The UTF-8 flag below.

There's too much fiddling with internals going on here, IMHO. Perl's
unicode implementation usually works best when you just use it. The
web document in question is sent as utf-8 from its server, by the way.

form post URL encoded	4	Jun 26, 2013
Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
filename charset and internal Perl utf8	3	Jun 8, 2006
HTML::TableExtract punctuation parsing	3	May 22, 2005
Receiving snmp traps in perl on Win platform?	1	Jun 2, 2008
japanese encoding iso-2022-jp in python vs. perl	4	Oct 23, 2007
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 23, 2007
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

Perl: Win-32 vs. linux

Maqo

John Bokma

Dave

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads