perl, XML::LibXML: encoding problems while changing attributes on an XML string

K

kellner

Hello,

I'm parsing a chunk of XML code and would like to add attribute values
to individual tags if these are lacking. This is with perl 5.8.6,
libxml2 2.6.17, XML::LibXML 1.58.

Basically, I have the parser add the attribute values to the respective
nodes and then use the toString method of XML::LibXML::Document to
write the modified text to a scalar. Both the original and the modified
text evaluate properly as utf8, but the modified text doesn't print
properly on the console, nor does it get entered as utf8 into a MySQL
database.

I don't really understand what's going on, and on what level the
error(s) could be located (console encoding, perl encoding, XML
encoding), and would appreciate any help I can get ...

Here's the code:
------------------------------------------------

#!/usr/bin/perl

use strict;
use XML::LibXML;
use Encode 'decode_utf8';
use vars qw ($parser $p);
$parser = XML::LibXML->new();
my $version = XML::LibXML::LIBXML_DOTTED_VERSION;
print "libxml2 $version\n-------------\nXML::LibXML
$XML::LibXML::VERSION\n-------------------\n";


$p->{text} = qq|
<p>
<q who="Blabla">pramÄṇavÄrttikasvavá¹›ttiá¹­Ä«kÄ</q> And this is
some further text.<br/>And even more text.<br/>And more.
<q who="Blabla2">The second quotation!</q>.
pramÄṇavÄrttikasvavá¹›ttiá¹­Ä«kÄ.
</p>|;

my $a = &validate_text($p->{text});
print "$a \n";

sub validate_text {
my $text = shift;
if (decode_utf8($text)) { print "TEXT is utf8\n";} else { print "is not
utf8\n";}
print "TESTING $text\n";
my $id = 1;
my $doc = $parser->parse_string($text);
my $root = $doc->getDocumentElement;

my @quotations = $root->findnodes('q');
foreach my $q (@quotations) {
unless ($q->hasAttribute('id')) { print "NO ID\n";
$q->setAttribute('id', "$id"); ++$id;}
else { print "HAS ID\n";}
my $id_new = $q->getAttribute('id');
print "NEW ID: $id_new\n";
}

my $newtext= $root->toString;
if (decode_utf8($newtext)) { print "NEW TEXT is utf8\n";} else { print
"is not utf8\n";}
return ($newtext);
}
------------------------------------------------------------

I know that I can set a document encoding by creating a new $doc
altogether, but I don't want to do this in this case, as the
createDocument method prepends an xml version string to the created
document, and this messes up the routines which process the code
afterwards.

Thanks in advance,

Birgit Kellner
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,708
Latest member
SherleneF1

Latest Threads

Top