Perl: Win-32 vs. linux

M

Maqo

Is there any reason the following would work on a linux installation of
Perl, but not using ActivePerl-5.8 on a Win-32 system? The tr///
operation successfully removes UTF-8 encoded   characters from the
string in linux, but not Win-32, even after verifying that all required
modules are installed. Any thoughts would be greatly appreciated!

-----------------------------------------------------------------------

use LWP::Simple;
use Encode;

my $URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2004/IO_07_04.htm";

$content = get($URL);
$decoded = decode("utf-8"=>$content);
$decoded =~ tr/\x{00a0}/ /;

print $decoded;
 
J

John Bokma

Maqo said:
Is there any reason the following would work on a linux installation
of Perl, but not using ActivePerl-5.8 on a Win-32 system? The tr///
operation successfully removes UTF-8 encoded   characters from
the string in linux, but not Win-32, even after verifying that all
required modules are installed. Any thoughts would be greatly
appreciated!

-----------------------------------------------------------------------

use LWP::Simple;
use Encode;

my $URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2004/IO_07_04
.htm";

$content = get($URL);
$decoded = decode("utf-8"=>$content);
$decoded =~ tr/\x{00a0}/ /;

print $decoded;

By best guess: decode maps to the internal encoding used by Perl, and I
guess on Windows this is Win-something, and on Linux ISO-something.

Why not remove the utf-8 encoded non-breakable spaces before the decoding
step?
 
D

Dave

John Bokma said:
By best guess: decode maps to the internal encoding used by Perl, and I
guess on Windows this is Win-something, and on Linux ISO-something.

Why not remove the utf-8 encoded non-breakable spaces before the decoding
step?


two thoughts and a question:

1) does adding:

use utf8;

make any difference?


2) Print $decoded to a file before the tr// step to see if it has a string
  or U+00a0 or just a mess. This will help in finding where in the code
the problem is.

Question.

Where in the script is the   string in the original html being parsed
into a U+00a0 ? Does get() do this automatically? I can't see anything in
the documentation that mentions this, but I can't see why it would happen on
a Linux system and not on Windows.

Dave
 
A

Alan J. Flavell

By best guess: decode maps to the internal encoding used by Perl,
and I guess on Windows this is Win-something, and on Linux
ISO-something.

I don't know why you think it's appropriate to "guess" this. The Perl
documentation is pretty clear about how characters are stored
internally (perldoc perlunicode), and if there *was* a difference, one
would expect to find it in the appropriate platform-specific perl
documentation.

The only thing that comes to mind is if the code calls Win32 *system*
functions, it may be necessary to run it with "wide system calls"
enabled. But that doesn't appear to be happening here.

If this was my problem, I'd be inclined to prepare a small test
document which I /knew/ contained these actual characters (as opposed
to containing   character entity references, I mean), rather than
relying on some massive web document from elsewhere; and print out in
detail what's going on internally. But that's only for diagnosis
purposes: Perl's unicode implementation works best when you just use
it, not mess around with internals.

(I really can't be bothered to wade through the whole mess of HTML and
javascript contained at the cited URL to get further with this,
sorry.)
Why not remove the utf-8 encoded non-breakable spaces before the
decoding step?

It worries me that the questioner writes:

| The tr/// operation successfully removes UTF-8 encoded  
| characters from the string in linux, but not Win-32

I see that the source contains quite a number of   character
entitity references. So the question is, are we really talking about
no-break space *characters*, or are we talking about their character
entity references?

If we're really talking about *characters*, then note the "Caveat" in
the documentation for decode() in Encode:

When you run $string = decode("utf8", $octets), then $string may not
be equal to $octets. Though they both contain the same data, the utf8
flag for $string is on unless $octets entirely consists of ASCII data
(or EBCDIC on EBCDIC machines). See The UTF-8 flag below.

There's too much fiddling with internals going on here, IMHO. Perl's
unicode implementation usually works best when you just use it. The
web document in question is sent as utf-8 from its server, by the way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top