LibXML element->toString vs document->toString

Fergus McMenemie · Jul 12, 2012

Hi, I have been driven mad by the following, which took ages to track
down. What is going on? I appears it is invalid to use toString on the
document object.

#! /usr/local/bin/perl -w
use strict;
use warnings;
use utf8;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin name="\xc5\x81"></plugin>

Fergus McMenemie · Jul 13, 2012

Ben Morrow said:
Quoth (e-mail address removed) (Fergus McMenemie):

Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
which is internal to perl and none of your business. (The Encode
documentation is not as clear about this as is might be, because it only
became clear through experience that this is the only approach which
works.)

Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.

My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.

What are you actually trying to find out?

I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.

Fergus McMenemie · Jul 14, 2012

Ben Morrow said:
I'm afraid I don't understand. When I run the original program I get the
results I would have expected: the first prints the XML without the
<?xml?>, the second prints it with it. What is going wrong for you?

Thanks for the tip. My code now reads:-

use strict;
use warnings;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
$src =~ s/\\x([0-9a-f][0-9a-f])/chr hex $1/egi;
$src = Encode::decode "utf8", $src;
print "LibXML VERSION=$XML::LibXML::VERSION\n";
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin
name="\xef\xbd\xb1\xef\xbd\xb2\xef\xbd\xb3\xef\xbd\xb4\xef\xbd\xb5"></pl
ugin>

And fails on my mac running OS X Snow Leopard. But the 'real' version is
running with perl 5.12 on centos and also fails there. No sure about the
version of LibXML.

Does it work for your?

Fergus McMenemie · Jul 14, 2012

Ben Morrow said:
Quoth (e-mail address removed) (Fergus McMenemie):

Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.

Click to expand...

It can certainly be difficult, given that Usenet officially doesn't
support anything but ASCII. Unofficially, if you can get your newsreader
to produce it, articles in UTF-8 with 'Content-type: text/plain;
charset=UTF-8' seem to work perfectly well.

Another thing you can do is explicitly decode the data in the program
you post; possibly something like

my $str = <DATA>;
$str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
$str = Encode::decode "utf8", $str;

This uses URL-encoding rather than backslashes; you can pick whatever is
convenient for the data you are trying to post.

My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.
OK.

I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.

Click to expand...

I'm afraid I don't understand. When I run the original program I get the
results I would have expected: the first prints the XML without the
<?xml?>, the second prints it with it. What is going wrong for you?

Ben

Fergus McMenemie · Jul 17, 2012

Ben Morrow said:
Ooh, they've actually published an update. I didn't know that.

My newsreader does not properly upport UTF8 I guess lots of others still
dont either.

MacSoup - my soups gone off!

Fergus McMenemie · Jul 17, 2012

Ben Morrow said:
Yes, it works as documented for me. Are you getting confused by the fact
that ->toString produces a byte string for whole documents, but a
character string for just an element? Read the 'ENCODINGS SUPPORT'
section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
printing a whole document, because the document isn't necessarily in
UTF-8.

Duh!
Thanks I dont know how I managed to miss that bit.

XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 8, 2009
how to $doc->createElement with XML::LibXML	2	Feb 22, 2010
LibXML UTF8 - Input is not proper UTF-8, indicate encoding !	2	Mar 5, 2005
perl, XML::LibXML: encoding problems while changing attributes on an XML string	0	Jul 23, 2006
XML::LibXML, newlines in nodes, and entities...	4	Jul 7, 2005
XML::LibXML::Reader Can't Find Method. libxml2 anyone?	1	Jul 25, 2008
LibXML UTF8 - Input is not proper UTF-8, indicate encoding !	1	Mar 5, 2005
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)	5	Mar 2, 2006

LibXML element->toString vs document->toString

Fergus McMenemie

Fergus McMenemie

Fergus McMenemie

Fergus McMenemie

Fergus McMenemie

Fergus McMenemie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads