N
novostik
Hello guys,
I want to get the DOM of an XML page.for eg:an XML
page, being converted from HTML using Tidy,is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title></title>
</head>
<body>
</body>
</html>
should print out html---head---meta ----title.
I have used the following code in perL....
-------------------------------------------------------------------------------------------------------------------------------------
use XML:OM;
my $parser = new XML:OM:arser;
my $doc = $parser->parsefile ("ig.xml");
my $nodes=$doc->getDocumentElement();
print "\n";
print $nodes->getNodeName();
print "--";
@x=$nodes->getChildNodes();
&find(@x);
sub find
{
my (@z)=@_;
foreach $z(@z)
{
@y=$z->getChildNodes();
if($z->getNodeType == ELEMENT_NODE)
{
print $z->getNodeName();
print"--";
}
&find(@y);
}
}
# Avoid memory leaks - cleanup circular references for garbage
collection
$doc->dispose;
---------------------------------------------------------------------------------------------------------------------------------------------
The problem is that it gives an output for some files but gives some
error message for other like the google and yahoo hompages.
could you please help me out on this as I was not able to rectify
it.Why does it work for some page and why not for others?
Could you please provide me a solution for this....
I want to get the DOM of an XML page.for eg:an XML
page, being converted from HTML using Tidy,is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title></title>
</head>
<body>
</body>
</html>
should print out html---head---meta ----title.
I have used the following code in perL....
-------------------------------------------------------------------------------------------------------------------------------------
use XML:OM;
my $parser = new XML:OM:arser;
my $doc = $parser->parsefile ("ig.xml");
my $nodes=$doc->getDocumentElement();
print "\n";
print $nodes->getNodeName();
print "--";
@x=$nodes->getChildNodes();
&find(@x);
sub find
{
my (@z)=@_;
foreach $z(@z)
{
@y=$z->getChildNodes();
if($z->getNodeType == ELEMENT_NODE)
{
print $z->getNodeName();
print"--";
}
&find(@y);
}
}
# Avoid memory leaks - cleanup circular references for garbage
collection
$doc->dispose;
---------------------------------------------------------------------------------------------------------------------------------------------
The problem is that it gives an output for some files but gives some
error message for other like the google and yahoo hompages.
could you please help me out on this as I was not able to rectify
it.Why does it work for some page and why not for others?
Could you please provide me a solution for this....