XML::Simple and utf8 woes

G

Guest

Dear wizards,

I use XML::Simple to parse an XML file and
also to write it out. The problem lies in the
utf8 character data contained in the XML
source. While the XMLin() function seems
to read them properly, the XMLout() function
tries to replace utf8 material by multibyte
nonsense.

Below is my minimal example, run under perl 5.8.5
on a Fedora C3 box. Just compare the output
of the script (in w.xml) with its input, in DATA.

Please advice on how to fix the broken utf8 output.

Thanks in advance,
Oliver.

#!/usr/bin/perl
use XML::Simple;
print "Reading data from XML source...\n";
$data=XMLin(\*DATA,
ForceArray=>[manju,hauer],
ContentKey=>'-content',
KeyAttr=>[name],
);
print "Retrieve and display data example:\n";
$k='0004.1';
print $k.": ".
$data->{lemma}->{$k}->{manju}->[0].
"\n";
print "Writing data to XML file...\n";
XMLout($data,
NumericEscape=>0,
RootName=>'wuti',
XMLDecl=>1,
OutputFile=>'w.xml',
);
__DATA__
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<wuti>
<lemma name="0004.1">
<hauer>in der Morgendämmerung (H).</hauer>
<manju>farhûn suwaliyame</manju>
</lemma>
<lemma name="0004.2">
<hauer>Morgendämmerung.</hauer>
<manju>gersi fersi</manju>
</lemma>
</wuti>
 
N

ngoc

Below is my minimal example, run under perl 5.8.5
on a Fedora C3 box. Just compare the output
of the script (in w.xml) with its input, in DATA.
I tried your code in Windows XP. It gives utf-8 output. But if I use
RootName => 'unicode here', only the output of rootname is changed
(manual fix will help), other parts are in utf-8. I suggest you

1. To save your perl program in utf-8 encoding.

2. This step in theory is not necessary. But maybe it helps

open my $fh, '>:encoding(UTF-8)', $path or die "open($path): $!";
XMLout($ref, OutputFile => $fh);

3. Try in Windows XP or 2000 environment to see it is different
 
G

Guest

: I tried your code in Windows XP. It gives utf-8 output.

Really? I'll have to try tomorrow, don't have an XP box here right now.

: RootName => 'unicode here', only the output of rootname is changed
: (manual fix will help), other parts are in utf-8.

Sounds interesting, I'll try this one, too.

: 1. To save your perl program in utf-8 encoding.

Doesn't make sense, I write everything in utf-8 environment. Did you
notice the a-umlaut and u-caret in the data?

: 2. This step in theory is not necessary. But maybe it helps

: open my $fh, '>:encoding(UTF-8)', $path or die "open($path): $!";
: XMLout($ref, OutputFile => $fh);

I had tried this already before posting, but to no avail.

: 3. Try in Windows XP or 2000 environment to see it is different

Tomorrow.

Thanks, Oliver.
 
G

Guest

(e-mail address removed)-berlin.de wrote:

: Really? I'll have to try tomorrow, don't have an XP box here right now.

I still don't have an XP system at hand.

If you run the code with the -CS flag given to perl, even the innocent
print statement in the middle of the code will output two characters
instead of one utf8-encoded character, and this doesn't change the broken
output of the XMLout() statement.

This is beyond any expectation created after reading the perlrun manpage.

However, if XML::Simple is instructed in the XMLout statement to escape
all non-ASCII characters, then, miraculuously, the correct utf8 replacements
appear. It really drives me nuts.

Oliver.
 
F

fhscobey

Hi,
You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
issues crop back up for some reason. Our application deals 100% in
UTF-8 data, but all source code is ISO-8859-1. We really had some
issues getting UTF-8 stuff to work (we started back when 5.8.0 came
out) and found that using 5.8.1, with some well placed ...

Encode::_utf8_on($content);
Encode::_utf8_off($content);

.... seemed to do the trick for us. So you might try to make sure the
UTF-8 flag is turned on for your XML data, and then try and parse it.
We are using some older versions of modules, which at the time, were
just starting to deal with the change in Perl 5.8 to treat content
internally as UTF-8 ecoded. Note: I believe Perl 5.8.7 has some
issues with the Encode module specifically with UTF-8, check with
bugs.perl.org for more information.

All of this may seem strange, but I can tell you when we wrote our
application, it worked fine with Perl 5.8.0 and 5.8.1. I've tried
5.8.3|5|7 and all versions are giving us garbled data out.

Also, if you are reading your data in from a handle, you absolutely
have to decalre the handle to be UTF-8 encoded. [i.e. open(FH,
"<:utf8", "file");].

Not sure if this helps you at all,
- Jeff
 
G

Guest

: Hi,
: You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
: issues crop back up for some reason. Our application deals 100% in
: UTF-8 data, but all source code is ISO-8859-1. We really had some
: issues getting UTF-8 stuff to work (we started back when 5.8.0 came
: out) and found that using 5.8.1, with some well placed ...

: Encode::_utf8_on($content);
: Encode::_utf8_off($content);

: issues with the Encode module specifically with UTF-8, check with
: bugs.perl.org for more information.

Hi Jeff,

You're really saved my day. So it's _not_ my personal failure to
understand how utf8 in Perl works, but really a problem, version-
dependent too. Thank you.

Anyway, of course, when using file handles, I make sure the line
discipline is set to :utf8, but it does not always help. See my other
answer to the Perl and UTF8 posting.

Best regards,
Oliver.
 
C

Chronos Tachyon

[Whoops, meant to post, not mail]

Dear wizards,

I use XML::Simple to parse an XML file and
also to write it out. The problem lies in the
utf8 character data contained in the XML
source. While the XMLin() function seems
to read them properly, the XMLout() function
tries to replace utf8 material by multibyte
nonsense.

Below is my minimal example, run under perl 5.8.5
on a Fedora C3 box. Just compare the output
of the script (in w.xml) with its input, in DATA.

Please advice on how to fix the broken utf8 output.

Thanks in advance,
Oliver.

#!/usr/bin/perl
use XML::Simple;
print "Reading data from XML source...\n";
$data=XMLin(\*DATA,
ForceArray=>[manju,hauer],
ContentKey=>'-content',
KeyAttr=>[name],
);
print "Retrieve and display data example:\n";
$k='0004.1';
print $k.": ".
$data->{lemma}->{$k}->{manju}->[0].
"\n";
print "Writing data to XML file...\n";
XMLout($data,
NumericEscape=>0,
RootName=>'wuti',
XMLDecl=>1,
OutputFile=>'w.xml',
);
__DATA__
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<wuti>
<lemma name="0004.1">
<hauer>in der Morgendämmerung (H).</hauer>
<manju>farhûn suwaliyame</manju>
</lemma>
<lemma name="0004.2">
<hauer>Morgendämmerung.</hauer>
<manju>gersi fersi</manju>
</lemma>
</wuti>

The problem seems to be the absence of a "use utf8;" pragma. Perl is
assuming that your code (including the __DATA__ section) is in ISO-8859-1.

[Addendum: FWIW, your newsreader is also making the same assumption.]
 
F

fhscobey

Donald brings up a good point. If your source is not ISO-8859-1(which
I believe you mentioned), you have to use the utf8 pragma. But, I also
believe if you were to try using Perl 5.8.0, you would have to use this
pragma even if it was only the data your script was dealing with.
Starting with 5.8.1+, they deprecated the use of this pragma, to only
be used for telling Perl what encoding your source was in.

See http://perldoc.perl.org/utf8.html for more information.
 
G

Guest

: The problem seems to be the absence of a "use utf8;" pragma. Perl is
: assuming that your code (including the __DATA__ section) is in ISO-8859-1.

No, I don't think so, as inserting the utf8 pragma doesn't change anything.
I tried it, and the output is still not in utf8.

: [Addendum: FWIW, your newsreader is also making the same assumption.]

That is a different story, on a different machine. My production code
runs in a true utf8 environment, this one here is only used for communi-
cations. Thank you for the hint, nonetheless!

Oliver.
 
G

Guest

: Donald brings up a good point. If your source is not ISO-8859-1(which
: I believe you mentioned), you have to use the utf8 pragma. But, I also
: believe if you were to try using Perl 5.8.0, you would have to use this
: pragma even if it was only the data your script was dealing with.
: Starting with 5.8.1+, they deprecated the use of this pragma, to only
: be used for telling Perl what encoding your source was in.

: See http://perldoc.perl.org/utf8.html for more information.

I read that, and also studied the various options to switch -C (see
perlrun for that), and I am really confused why the behaviour of my
system is so out of sync with the descriptions in the documentation.

Oliver.
 
D

Donald King

[Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

: The problem seems to be the absence of a "use utf8;" pragma. Perl is
: assuming that your code (including the __DATA__ section) is in ISO-8859-1.

No, I don't think so, as inserting the utf8 pragma doesn't change anything.
I tried it, and the output is still not in utf8.

FWIW, I've taken your original test case from the top of the thread and
fixed it up. It's now properly encoded in UTF-8, it uses both "use
utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
to run under "use strict" and "use warnings" while I was in there. You
can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

BTW, during my testing, I found that, if script.pl has a "#!perl -CS"
shebang line, "./script.pl" uses the -CS but "perl script.pl" doesn't.
I guess I was so used to -w and -T being automatically picked up from
the shebang line, I didn't realize that Perl doesn't interpret *all* the
flags there. I'd recommend explicit binmode() calls or the "open"
pragma instead of the -C flag, due to the confusion it can cause.
 
G

Guest

: [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

: FWIW, I've taken your original test case from the top of the thread and
: fixed it up. It's now properly encoded in UTF-8, it uses both "use
: utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
: to run under "use strict" and "use warnings" while I was in there. You
: can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

Hi Donald,

Thank you _very_ much for the fixed code. I ran it, and to no avail. The
problems remain. Can you tell me which environment the code worked for you?

My environment:

perl -v states:
perl v5.8.5 built for i386-linux-thread-multi

echo $LANG states:
en_US.UTF-8

in vim, opening the file in utf8 encoding succeeds (and displays correctly)

When running the file from the command line
../corff.pl

I get:
1) broken output of the print statement
2) over-interpreted representations o utf8 data in the output file w.xml.

If I disable _both_ the
# use utf8;
....
# binmode(STDOUT, ":utf8");

lines,

the output of the print statement is _correct_ (accented characters
show properly), whereas the output of the w.xml file is still garbage.

If, in a fit of desperation, I modify the output of XMLout() with
NumericEscape=>2, all I get in the output is that, eg. a umlaut of
Morgend&auml;mmerung (sorry for this encoding-independet symbolic
notation here!) is represented as ä which happens to be the
decimal values of the two octets comprising U+00e4, or Latin small a
with umlaut.

I've already considered to suffer silently from now onwards and to write
a small filter that replaces all theses bytes in the final output, but
then, I think this is deeply unsatisfying.

Thanks again,
Oliver.

PS: A small truth table when using utf8 and the binmode statements:

use utf8 binmode
yes yes print fails, XMLout fails
no yes print fails, XMLout fails
yes no print succeeds, XMLout fails
no no print succeeds, XMLout fails

We see that the utf8 pragma doesn't change anything even though the
data section of my script is utf8-material whereas binmode (STDOUT,':utf8')
seems to have the opposite effect of what it claims.
 
P

Peter J. Holzer

: <http://chronos-tachyon.net/~chronos/corff.pl>.

Thank you _very_ much for the fixed code. I ran it, and to no avail.
The problems remain. Can you tell me which environment the code worked
for you?

My environment:

perl -v states:
perl v5.8.5 built for i386-linux-thread-multi

echo $LANG states:
en_US.UTF-8

in vim, opening the file in utf8 encoding succeeds (and displays
correctly)

I assume that your terminal is in UTF-8 mode, too, then. You could
verify that by invoking "cat corff.pl" and checking whether it looks
correct.

When running the file from the command line
./corff.pl

I get:
1) broken output of the print statement
2) over-interpreted representations o utf8 data in the output file
w.xml.

It works for me on FC3 (which is what you are using, too, if I remember
one of your previous posts correctly).

XML::Simple doesn't parse XML itself. Which XML parser are you using?
I use XML::LibXML.

For reference, here is the output of
rpm -qa | grep perl-XML
on this machine.

perl-XML-NamespaceSupport-1.08-6
perl-XML-LibXML-1.58-1
perl-XML-Dumper-0.71-2
perl-XML-SAX-0.12-7
perl-XML-Encoding-1.01-26
perl-XML-Grove-0.46alpha-27
perl-XML-Parser-2.34-5
perl-XML-Twig-3.13-6
perl-XML-LibXML-Common-0.13-7

If, in a fit of desperation, I modify the output of XMLout() with
NumericEscape=>2, all I get in the output is that, eg. a umlaut of
Morgend&auml;mmerung (sorry for this encoding-independet symbolic
notation here!) is represented as ä

This is definitely wrong. It should be only one entity (ä in this
case). So probably your parser parses the file as ISO-8859-1 instead of
UTF-8 or passes the "raw" strings on instead of converting them into
perl's internal utf-8 representation.
I've already considered to suffer silently from now onwards and to
write a small filter that replaces all theses bytes in the final
output, but then, I think this is deeply unsatisfying.

Don't. It looks like the error happens on input, so if you have to
resort to such crude hacks, replace the bytes just after reading the
input.

hp
 
D

Donald King

: [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

: FWIW, I've taken your original test case from the top of the thread and
: fixed it up. It's now properly encoded in UTF-8, it uses both "use
: utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
: to run under "use strict" and "use warnings" while I was in there. You
: can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

Hi Donald,

Thank you _very_ much for the fixed code. I ran it, and to no avail. The
problems remain. Can you tell me which environment the code worked for you?

My environment:

perl -v states:
perl v5.8.5 built for i386-linux-thread-multi

echo $LANG states:
en_US.UTF-8

in vim, opening the file in utf8 encoding succeeds (and displays correctly)

My environment:

Perl: v5.8.8 built for i486-linux-gnu-thread-multi
LANG: en_US.UTF-8
Vim: encoding=utf-8 fileencoding=utf-8 termencoding=utf-8
Terminal: Gnome-Terminal w/ encoding set to "Current Locale (UTF-8)"

Typing "cat corff.pl" prints the source code, complete with funny German
scribbles over the vowels. ;-)

Oh, and since it may be relevant:
XML::Simple version 2.14
XML::parser version 2.34
XML::SAX version 0.12

Whoops, I think I just found the problem. When I was checking version
numbers, I went ahead and checked CPAN for newer versions. After
installing XML::SAX version 0.13, the code broke. Try downgrading to
0.12 and see if that fixes things. (You can find a copy at
<http://search.cpan.org/~msergeant/XML-SAX-0.12/>.)
 
D

Donald King

Donald said:
: [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

: FWIW, I've taken your original test case from the top of the thread and
: fixed it up. It's now properly encoded in UTF-8, it uses both "use
: utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
: to run under "use strict" and "use warnings" while I was in there. You
: can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

Hi Donald,

Thank you _very_ much for the fixed code. I ran it, and to no avail. The
problems remain. Can you tell me which environment the code worked for
you?

My environment:

perl -v states:
perl v5.8.5 built for i386-linux-thread-multi

echo $LANG states:
en_US.UTF-8

in vim, opening the file in utf8 encoding succeeds (and displays
correctly)

My environment:

Perl: v5.8.8 built for i486-linux-gnu-thread-multi
LANG: en_US.UTF-8
Vim: encoding=utf-8 fileencoding=utf-8 termencoding=utf-8
Terminal: Gnome-Terminal w/ encoding set to "Current Locale (UTF-8)"

Typing "cat corff.pl" prints the source code, complete with funny German
scribbles over the vowels. ;-)

Oh, and since it may be relevant:
XML::Simple version 2.14
XML::parser version 2.34
XML::SAX version 0.12

Whoops, I think I just found the problem. When I was checking version
numbers, I went ahead and checked CPAN for newer versions. After
installing XML::SAX version 0.13, the code broke. Try downgrading to
0.12 and see if that fixes things. (You can find a copy at
<http://search.cpan.org/~msergeant/XML-SAX-0.12/>.)

FWIW, I've been on a goose chase through the guts of XML::SAX::purePerl,
and it seems both versions are horribly buggy with UTF-8. As a quick
fix, install either XML::SAX::Expat, XML::SAX::ExpatXS, or
XML::LibXML::SAX. All 3 seem to work just fine.
 
G

Guest

: >
: > Typing "cat corff.pl" prints the source code, complete with funny German
: > scribbles over the vowels. ;-)

Yes, it does so.

: > Oh, and since it may be relevant:
: > XML::Simple version 2.14
: > XML::parser version 2.34
: > XML::SAX version 0.12
: >

I'll look into that later today.

: FWIW, I've been on a goose chase through the guts of XML::SAX::purePerl,
: and it seems both versions are horribly buggy with UTF-8. As a quick
: fix, install either XML::SAX::Expat, XML::SAX::ExpatXS, or
: XML::LibXML::SAX. All 3 seem to work just fine.

That sounds apalling. All the more as XML claims to use Unicode/utf8
as its encoding of choice, but very obviously though, developers of
the above-mentioned packages have potentially never tested their packages
with some true utf-8 data (perhaps including umlauts and Chinese).

Thank you very much for your efforts!

Oliver.
 
G

Guest

: I've been following this thread because I have been struggling with
: XML::Simple writing/sourcing an XML file in cp932 encoding. The
: NumericEscape is what resolved the writing and setting the encoding in
: the xml declaration of the cp932 encoded file to x-sjis-cp932 so
: XML::Simple would source it properly took me awhile to figure out :-(.

[ good examples snipped ]

Hi Dennis and all others who have contributed to this thread,

Thank you very much for your input.

I followed the idea of the broken SAX module and decided to make other
parsers usable by XML::SAX and by simply installing SAX::Expat as well
as XML::LibXML (my code now uses the latter, automagically) the script
finally runs flawlessly. What a mess of a difficult delivery it was!

Thanks again to all,

Oliver.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,709
Latest member
AustinMudi

Latest Threads

Top