LibXML element->toString vs document->toString

F

Fergus McMenemie

Hi, I have been driven mad by the following, which took ages to track
down. What is going on? I appears it is invalid to use toString on the
document object.


#! /usr/local/bin/perl -w
use strict;
use warnings;
use utf8;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin name="\xc5\x81"></plugin>
 
F

Fergus McMenemie

Ben Morrow said:
Quoth (e-mail address removed) (Fergus McMenemie):

Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
which is internal to perl and none of your business. (The Encode
documentation is not as clear about this as is might be, because it only
became clear through experience that this is the only approach which
works.)

Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.

My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.
What are you actually trying to find out?
I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.
 
F

Fergus McMenemie

Ben Morrow said:
I'm afraid I don't understand. When I run the original program I get the
results I would have expected: the first prints the XML without the
<?xml?>, the second prints it with it. What is going wrong for you?

Thanks for the tip. My code now reads:-

use strict;
use warnings;
use Encode;
use XML::LibXML;
binmode(STDOUT, ":utf8");

my $src= join("",<DATA>);
$src =~ s/\\x([0-9a-f][0-9a-f])/chr hex $1/egi;
$src = Encode::decode "utf8", $src;
print "LibXML VERSION=$XML::LibXML::VERSION\n";
print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
my $parser = XML::LibXML->new();
my $x = $parser->parse_string($src)->documentElement();
my $str=$x->toString(1);
print "$str\n";
print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

$x = $parser->parse_string($src);
$str=$x->toString(1);
print "$str\n";
print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

__DATA__
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<plugin
name="\xef\xbd\xb1\xef\xbd\xb2\xef\xbd\xb3\xef\xbd\xb4\xef\xbd\xb5"></pl
ugin>


And fails on my mac running OS X Snow Leopard. But the 'real' version is
running with perl 5.12 on centos and also fails there. No sure about the
version of LibXML.

Does it work for your?
 
F

Fergus McMenemie

Ben Morrow said:
Quoth (e-mail address removed) (Fergus McMenemie):
Agreed, the warnings are there. However it did appear to make the
issue clearer. This example is rather goofy and posting it to USEnet
added a few more wrinkles. My original code and the real program
contained the actual characters. However my USEnet reader would not
let me post the real chars. Hence the octets.

It can certainly be difficult, given that Usenet officially doesn't
support anything but ASCII. Unofficially, if you can get your newsreader
to produce it, articles in UTF-8 with 'Content-type: text/plain;
charset=UTF-8' seem to work perfectly well.

Another thing you can do is explicitly decode the data in the program
you post; possibly something like

my $str = <DATA>;
$str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
$str = Encode::decode "utf8", $str;

This uses URL-encoding rather than backslashes; you can pick whatever is
convenient for the data you are trying to post.
My issue is that document->toString does not appear to work. Please
ignore the use of us_utf8.
OK.

I have to pass references to DOM objects around all over the
place. I find I am having to make use of either documentElement()
or ownerDocument() depending on what I am doing. I would like to have
a consistent "pattern" for doing this. I would like to setting on
passing the document object around but it is anoying that I cant then
use toString.

I'm afraid I don't understand. When I run the original program I get the
results I would have expected: the first prints the XML without the
<?xml?>, the second prints it with it. What is going wrong for you?

Ben
 
F

Fergus McMenemie

Ben Morrow said:
Ooh, they've actually published an update. I didn't know that.

My newsreader does not properly upport UTF8 I guess lots of others still
dont either.

MacSoup - my soups gone off!
 
F

Fergus McMenemie

Ben Morrow said:
Yes, it works as documented for me. Are you getting confused by the fact
that ->toString produces a byte string for whole documents, but a
character string for just an element? Read the 'ENCODINGS SUPPORT'
section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
printing a whole document, because the document isn't necessarily in
UTF-8.

Duh!
Thanks I dont know how I managed to miss that bit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top