E
Erik Wasser
Hello Usenet.
I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML:arser.
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.
Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:
ÃÃÃäöüà (0) cmp ÄÖÜäöüß (0) = -1
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0
The first line was parsed by XML::Sax and the second line was parsed by
XML:arser. My conclusions:
1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax
Your opinion?
The code (written in ISO-8859-1 on disc):
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::Simple;
use Encode;
foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>");
my $q2 = "ÄÖÜäöüß";
printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple:REFERRED_PARSER = 'XML:arser';
}
PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.
I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML:arser.
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.
Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:
ÃÃÃäöüà (0) cmp ÄÖÜäöüß (0) = -1
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0
The first line was parsed by XML::Sax and the second line was parsed by
XML:arser. My conclusions:
1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax
Your opinion?
The code (written in ISO-8859-1 on disc):
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::Simple;
use Encode;
foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>");
my $q2 = "ÄÖÜäöüß";
printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple:REFERRED_PARSER = 'XML:arser';
}
PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.