Greetings. I'm just starting to dabble in XML, and I've run across a
problem.
I'm going through my XML document using this construct:
use XML:
OM;
my $parser = new XML:
OM:
arser
or bail ("Unable to create XML parser");
my $story = $parser->parse($data);
$out{"SOURCE"} = $story->getElementsByTagName("Source")->
item(0)-> getFirstChild->getData;
$out{"DATE"} = $story->getElementsByTagName("Publication_Date")->
item(0)-> getFirstChild->getData;
$out{"TEXT"} = $story->getElementsByTagName("Body_Text")->
item(0)-> getFirstChild->getData or die "$!";
Everything works fine, until I get to a $story where Body_Text doesn't
exist. I've looked through all of the XML:
OM docs that I can find,
but I can't find either a way to test if Body_Text exists, or what happens
when Body_Text doesn't exist. The script stops with no obvious
diagnostic output -- it appears that the "die" never happens.
Can somebody show me the light?
hymie!
http://www.smart.net/~hymowitz hymie_@_lactose.homelinux.net
===============================================================================
I've got an answer. I'm going to fly away. What have I got to lose?
--Crosby, Stills, and Nash
===============================================================================
The alternative to DOM is SAX, widely used in modern code.
Its basically a simple event driven model, calling handlers
when the basic structured xml components are encountered. This
allows you to control going from xml to internal data structures
and/or back out to xml. Expat provides hooking handlers to most
of the current W3c constructs. These are just the basic ones.
Its up to you to extract the data into internal structures.
For that XML:Simple is a good tool. With Expat you can accumulate
nested data in a single string. Then Simple will create nested
Perl structures using tag names. Then you can Dumper it.
But, nobody uses Xml that doesen't know ahead of time what those
structures are both out and in. This is a way to control/validate/
populate them. SAX gives you a much simpler model and allows
much better control of the data. If you need more information
let me know. Getting Xerces working is a chore (you could do
without it for now, its only being used for schema checking here).
This code chunk sample is from 7,000 line code I wrote that was
converted
to a binary with Perl2Exe (including Xerces). I've chopped it up,
you can't see or know what it does so it will look nasty but
all the clues are there for you to investigate SAX and thats enough.
-gluck
---
This code is chopped out of a large practical xml code base and is NOT
cut & paste workable. Its just for instructional purposes
for the poster to give a flavor of SAX: Simple Api Xml.
use XML::Xerces;
use XML:
arser::Expat;
use XML::Simple;
## main
{
## Initialize program / build list of xml files (ie: glob)
for (@XmlFiles)
{
/.+$dlimsep(.+)$/; (defined elsewhere for win/unix os)
$XML_File = $1;
Log ($XML_File);
## Validate Schema with Xerces
## note: Xerces is being used for schema validation
and
## as backup xml integrity (done elsewhere)
next if (!ValidateSchema ($_));
if (!open(SAMP, $_)) {
Log (...);
next;
}
## Parse xml and integrity check (Expat-SAX)
my $parser = new XML:
arser::Expat;
$parser->setHandlers('Start' => \&stag_h,
'End' => \&etag_h,
'Char' => \&cdata_h);
$parser->setHandlers('Comment' => \&comment_h) if
($hVars{'CommentLogging'});
eval {$parser->parse(*SAMP)};
if ($@) {
## xml integrity failed -log this error
$@ =~ s/^[\x20\n\t]+//; $@ =~
s/[\x20\n\t]+$//;
# attempt strip off program line,col info at
end
$@ =~ s/(at line [0-9]+,.+)?at .+ line
[0-9]+$/$1/;
Log (...error...);
}
close(SAMP);
$parser->release;
}
}
########################################################
# EXPAT Event Handlers - start/end/content (defaults)
########################################################
##
sub stag_h # -- Start Tag --
{
my ($p, $element, %atts) = @_;
$element = uc($element);
$last_content = '';
$last_syntax_content = '';
## -- construct & Print start tag --
my $tag = "\<$element\>";
if ($XML_PRINT) {
printf ("%3d", $p->current_line);
print get_indent();
print "$tag";
print " Attr" if (keys %atts);
foreach my $key (keys %atts) {
print ", $key=".$atts{$key};
}
print "\n";
}
$tab_lev++;
## -- set Detached special content handler --
if (exists ($Content_hash{$element}) &&
$Content_hash{$element}->[1]) {
$p->setHandlers('Char' =>
$Content_hash{$element}->[0]);
}
## do something with attributes
## start keying (populating) your data structures
## set flags, etc ...
}
##
##
sub etag_h # -- End Tag --
{
my ($p, $element) = @_;
$element = uc($element);
## -- Construct & Print end tag --
my $tag = "\</$element\>";
$tab_lev--;
if ($XML_PRINT) {
printf ("%3d", $p->current_line);
print get_indent();
print "$tag\n";
}
## -- store last Content in hash (do more stuff)
## then:
$last_content = '';
## -- Restore default content handler --
if (exists ($Content_hash{$element}) &&
$Content_hash{$element}->[1]) {
$p->setHandlers('Char' => \&cdata_h);
my $last = (@Action) - 1;
my $aref = $Action[$last];
$Content_hash{$element}->[2]($last_syntax_content,
$aref, $element);
}
}
##
##
sub cdata_h # -- Default Content Data --
{
my ($p, $str) = @_;
# use original for entities, incase reparse
$str = $p->original_string;
# remove leading/trailing space, newline, tab
$str =~ s/^[\x20\n\t]+//; $str =~ s/[\x20\n\t]+$//;
if (length ($str) > 0)
{
if ($XML_PRINT) {
printf ("%3d", $p->current_line);
print get_indent();
print "$str (".length($str).")\n";
}
$last_content .= $str;
}
}
##
##
sub comment_h # -- Default Comment Data --
{
my ($p, $str) = @_;
# use original for entities, incase reparse
$str = $p->original_string;
# remove leading/trailing space, newline, tab
$str =~ s/^[\x20\n\t]+//; $str =~ s/[\x20\n\t]+$//;
if (length ($str) > 0)
{
printf (" %d,%d\n",
$p->current_line,$p->current_column);
}
}
##
##
sub cdata_x_h # -- Special Content Data --
{
my ($p, $str) = @_;
cdata_h ($p, $str);
# remove leading/trailing space, newline, tab
$str =~ s/^[\x20\n\t]+//; $str =~ s/[\x20\n\t]+$//;
$last_syntax_content .= $str if (length ($str) > 0);
}
########################################################
# Xerces - too much to explain
########################################################
#
sub ValidateSchema {
my ($xfile) = @_;
#my $valerr = 0;
# Docs:
http://xml.apache.org/xerces-c/apiDocs/classAbstractDOMParser.html#z869_9
my $Xparser = XML::Xerces::XercesDOMParser->new();
$Xparser->setValidationScheme(1);
$Xparser->setDoNamespaces(1);
$Xparser->setDoSchema(1);
#$Xparser->setValidationSchemaFullChecking(1); # full
constraint (if enabled, may be time-consuming)
$Xparser->setExternalNoNamespaceSchemaLocation($hVdef{'Schema'});
my $ERROR_HANDLER = XLoggingErrorHandler->new(\&LogX_warn,
\&LogX_error, \&LogX_ferror, );
#my $ERROR_HANDLER = XML::Xerces:
erlErrorHandler->new();
$Xparser->setErrorHandler($ERROR_HANDLER);
# no need for eval on parse with handlers.. just insurance on
die
eval {$Xparser->parse
(XML::Xerces::LocalFileInputSource->new($xfile));};
if ($@) {
}
return 1;
}
## handlers (alot more not shown)