Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Erik Wasser · Mar 2, 2006

Hello Usenet.

I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see code
below). But I'm getting different results using XML-Simple with the
default XML parser named XML::Sax and a second parser named XML:

arser.
The following code tries to decode the mini XML file and prints the UTF8
flags of the resulting strings.

Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:

ÃÃÃÃ¤Ã¶Ã¼Ã (0) cmp ÄÖÜäöüß (0) = -1
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0

The first line was parsed by XML::Sax and the second line was parsed by
XML:

arser. My conclusions:

1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax

Your opinion?

The code (written in ISO-8859-1 on disc):

#!/usr/bin/perl -w

use strict;
use warnings;

use XML::Simple;
use Encode;

foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>");
my $q2 = "ÄÖÜäöüß";

printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple:

REFERRED_PARSER = 'XML:

arser';
}

PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.

A. Sinan Unur · Mar 2, 2006

(e-mail address removed) (Erik Wasser) wrote in

I'm subject to some confusion with XML and UTF8. I'm working with
XML-Simple and I try to decode some XML with with german umlauts
(ISO-8859-1). The first XML line declared the encoding correct (see
code below). But I'm getting different results using XML-Simple with
the default XML parser named XML::Sax and a second parser named
XML:arser. The following code tries to decode the mini XML file and
prints the UTF8 flags of the resulting strings.

Can someone run this code on his machine and post the results? Thanks.
The results on my machine are this:

ÃÃÃÃ¤Ã¶Ã¼Ã (0) cmp ÄÖÜäöüß (0) = -1
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0

The first line was parsed by XML::Sax and the second line was parsed
by XML:arser. My conclusions:

1) Line 1 is wrong, line 2 is correct
2) The output should be line 2 two times.
3) There is a bug in XML::Sax

Your opinion?

The code (written in ISO-8859-1 on disc):

#!/usr/bin/perl -w

use strict;
use warnings;

use XML::Simple;
use Encode;

foreach (1..2)
{
my $q1 = XMLin("<?xml version='1.0'
encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>"); my $q2 = "ÄÖÜäöüß";

printf "%s (%d) cmp %s (%d) = %d\n"
, $q1, Encode::is_utf8($q1)
, $q2, Encode::is_utf8($q2)
, $q1 cmp $q2;
# and again with the non default parser
$XML::Simple:REFERRED_PARSER = 'XML:arser';
}

PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
expat-1.95.8.

First off, let me say I don't know much about this stuff. I am on the US
English version of XP. I copied and pasted the code above into Gvim, and
then ran it. I got:

D:\Home\asu1\UseNet\clpmisc> r > results.txt

D:\Home\asu1\UseNet\clpmisc> cat results.txt
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0
ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0

I would be inclined to look at what changed in XML-SAX between versions
0.12 and 0.13, but then, as I said, I don't know much about encodings
etc.

I have XML-SAX-0.12 and XML-Parser-2.34 and

D:\Home\asu1\UseNet\clpmisc> perl -v

This is perl, v5.8.7 built for MSWin32-x86-multi-thread
(with 14 registered patches, see perl -V for more detail)

Copyright 1987-2005, Larry Wall

Binary build 815 [211909] provided by ActiveState
http://www.ActiveState.com
ActiveState is a division of Sophos.
Built Nov 2 2005 08:44:52

Sinan

robic0 · Mar 5, 2006

(e-mail address removed) (Erik Wasser) wrote in

You didn't try to decode in German! You might have changed the "code page"
to German to get different character sets. It doesn't matter. I'm looking at
your character in whatever "code page" is on my machine. UTF8 is Unicode.
Its not discernable unless you have a Unicode "aware" renderer. You can't
just change the characters on the page via cut & paste and it turns into
Unicode. If you open or save a Unicode document from a Unicode aware editor
the represented character will not be noticable as Unicode, so it's not
something that can be "cut 'n pasted" into a newsgroup, as code to be
tested! UTF8, even "multi-byte" is transparent to the user and only known
to the renderer. Data from a file that is read into a parser (or a Perl
program that is UTF8 aware) that is Unicode is treated as Unicode in its
variable representation and interaction with other variables. If a regex
is to be applied to Unicode data from an aware Perl parser, it works
every time.

robic0 · Mar 5, 2006

You didn't try to decode in German! You might have changed the "code page"
to German to get different character sets. It doesn't matter. I'm looking at
your character in whatever "code page" is on my machine. UTF8 is Unicode.
Its not discernable unless you have a Unicode "aware" renderer. You can't
just change the characters on the page via cut & paste and it turns into
Unicode. If you open or save a Unicode document from a Unicode aware editor
the represented character will not be noticable as Unicode, so it's not
something that can be "cut 'n pasted" into a newsgroup, as code to be
tested! UTF8, even "multi-byte" is transparent to the user and only known
to the renderer. Data from a file that is read into a parser (or a Perl
program that is UTF8 aware) that is Unicode is treated as Unicode in its
variable representation and interaction with other variables. If a regex
is to be applied to Unicode data from an aware Perl parser, it works
every time.

Just a followup, I know your question was with xml, but if you wan't to use
unicode "outside" the 0-128 bracket fro regex you might want to use the
codes as in this simple example (which just uses various "ranges"):

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);

Erik Wasser · Mar 5, 2006

robic0 said:
Just a followup, I know your question was with xml, but if you wan't to use
unicode "outside" the 0-128 bracket fro regex you might want to use the
codes as in this simple example (which just uses various "ranges"):

My question was: why two XML parsers are getting different results? The
different results are confusing me not unicode itself.

Peter J. Holzer · Mar 5, 2006

Erik Wasser wrote:

[XML::Simple gives correct results with XML:

arser, but wrong results
with XML::SAX]

My question was: why two XML parsers are getting different results?
The different results are confusing me not unicode itself.

Looks like a bug in XML::SAX or one of the libraries it uses.
However, like Sinan, I cannot reproduce it here on a Debian Sarge
system:

perl, v5.8.4 built for i386-linux-thread-multi
XML::Simple version 2.14
XML::SAX version 0.12
XML:

arser version 2.34
libexpat1 1.95.8-3

So it may be caused by something weird in your einvironment.

hp

little problem with xml::dom::parser	0	Jan 22, 2008
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 8, 2009
optimize XML parsing	2	Jun 12, 2007
XML parser	0	Oct 23, 2003
XML / Unicode / SAX question	2	Jul 4, 2007
XML file parsing with SAX	3	Apr 23, 2005
parsing xml using perl regex help	6	Jan 17, 2007
trying to use sax for a very basic first xml parser	4	Jul 14, 2008

Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Erik Wasser

A. Sinan Unur

robic0

robic0

Erik Wasser

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads