K
kellner
Hello,
I'm parsing a chunk of XML code and would like to add attribute values
to individual tags if these are lacking. This is with perl 5.8.6,
libxml2 2.6.17, XML::LibXML 1.58.
Basically, I have the parser add the attribute values to the respective
nodes and then use the toString method of XML::LibXML:ocument to
write the modified text to a scalar. Both the original and the modified
text evaluate properly as utf8, but the modified text doesn't print
properly on the console, nor does it get entered as utf8 into a MySQL
database.
I don't really understand what's going on, and on what level the
error(s) could be located (console encoding, perl encoding, XML
encoding), and would appreciate any help I can get ...
Here's the code:
------------------------------------------------
#!/usr/bin/perl
use strict;
use XML::LibXML;
use Encode 'decode_utf8';
use vars qw ($parser $p);
$parser = XML::LibXML->new();
my $version = XML::LibXML::LIBXML_DOTTED_VERSION;
print "libxml2 $version\n-------------\nXML::LibXML
$XML::LibXML::VERSION\n-------------------\n";
$p->{text} = qq|
<p>
<q who="Blabla">pramÄṇavÄrttikasvavá¹›ttiá¹Ä«kÄ</q> And this is
some further text.<br/>And even more text.<br/>And more.
<q who="Blabla2">The second quotation!</q>.
pramÄṇavÄrttikasvavá¹›ttiá¹Ä«kÄ.
</p>|;
my $a = &validate_text($p->{text});
print "$a \n";
sub validate_text {
my $text = shift;
if (decode_utf8($text)) { print "TEXT is utf8\n";} else { print "is not
utf8\n";}
print "TESTING $text\n";
my $id = 1;
my $doc = $parser->parse_string($text);
my $root = $doc->getDocumentElement;
my @quotations = $root->findnodes('q');
foreach my $q (@quotations) {
unless ($q->hasAttribute('id')) { print "NO ID\n";
$q->setAttribute('id', "$id"); ++$id;}
else { print "HAS ID\n";}
my $id_new = $q->getAttribute('id');
print "NEW ID: $id_new\n";
}
my $newtext= $root->toString;
if (decode_utf8($newtext)) { print "NEW TEXT is utf8\n";} else { print
"is not utf8\n";}
return ($newtext);
}
------------------------------------------------------------
I know that I can set a document encoding by creating a new $doc
altogether, but I don't want to do this in this case, as the
createDocument method prepends an xml version string to the created
document, and this messes up the routines which process the code
afterwards.
Thanks in advance,
Birgit Kellner
I'm parsing a chunk of XML code and would like to add attribute values
to individual tags if these are lacking. This is with perl 5.8.6,
libxml2 2.6.17, XML::LibXML 1.58.
Basically, I have the parser add the attribute values to the respective
nodes and then use the toString method of XML::LibXML:ocument to
write the modified text to a scalar. Both the original and the modified
text evaluate properly as utf8, but the modified text doesn't print
properly on the console, nor does it get entered as utf8 into a MySQL
database.
I don't really understand what's going on, and on what level the
error(s) could be located (console encoding, perl encoding, XML
encoding), and would appreciate any help I can get ...
Here's the code:
------------------------------------------------
#!/usr/bin/perl
use strict;
use XML::LibXML;
use Encode 'decode_utf8';
use vars qw ($parser $p);
$parser = XML::LibXML->new();
my $version = XML::LibXML::LIBXML_DOTTED_VERSION;
print "libxml2 $version\n-------------\nXML::LibXML
$XML::LibXML::VERSION\n-------------------\n";
$p->{text} = qq|
<p>
<q who="Blabla">pramÄṇavÄrttikasvavá¹›ttiá¹Ä«kÄ</q> And this is
some further text.<br/>And even more text.<br/>And more.
<q who="Blabla2">The second quotation!</q>.
pramÄṇavÄrttikasvavá¹›ttiá¹Ä«kÄ.
</p>|;
my $a = &validate_text($p->{text});
print "$a \n";
sub validate_text {
my $text = shift;
if (decode_utf8($text)) { print "TEXT is utf8\n";} else { print "is not
utf8\n";}
print "TESTING $text\n";
my $id = 1;
my $doc = $parser->parse_string($text);
my $root = $doc->getDocumentElement;
my @quotations = $root->findnodes('q');
foreach my $q (@quotations) {
unless ($q->hasAttribute('id')) { print "NO ID\n";
$q->setAttribute('id', "$id"); ++$id;}
else { print "HAS ID\n";}
my $id_new = $q->getAttribute('id');
print "NEW ID: $id_new\n";
}
my $newtext= $root->toString;
if (decode_utf8($newtext)) { print "NEW TEXT is utf8\n";} else { print
"is not utf8\n";}
return ($newtext);
}
------------------------------------------------------------
I know that I can set a document encoding by creating a new $doc
altogether, but I don't want to do this in this case, as the
createDocument method prepends an xml version string to the created
document, and this messes up the routines which process the code
afterwards.
Thanks in advance,
Birgit Kellner