How to get the DOM from a XML page

N

novostik

Hello guys,
I want to get the DOM of an XML page.for eg:an XML
page, being converted from HTML using Tidy,is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title></title>
</head>
<body>
</body>
</html>

should print out html---head---meta ----title.

I have used the following code in perL....
-------------------------------------------------------------------------------------------------------------------------------------
use XML::DOM;
my $parser = new XML::DOM::parser;
my $doc = $parser->parsefile ("ig.xml");
my $nodes=$doc->getDocumentElement();
print "\n";
print $nodes->getNodeName();
print "--";
@x=$nodes->getChildNodes();

&find(@x);

sub find
{
my (@z)=@_;
foreach $z(@z)
{
@y=$z->getChildNodes();
if($z->getNodeType == ELEMENT_NODE)
{

print $z->getNodeName();
print"--";
}
&find(@y);
}
}

# Avoid memory leaks - cleanup circular references for garbage
collection
$doc->dispose;
---------------------------------------------------------------------------------------------------------------------------------------------


The problem is that it gives an output for some files but gives some
error message for other like the google and yahoo hompages.
could you please help me out on this as I was not able to rectify
it.Why does it work for some page and why not for others?
Could you please provide me a solution for this....
 
J

John Bokma

The problem is that it gives an output for some files but gives some
error message for other like the google and yahoo hompages.
could you please help me out on this as I was not able to rectify
it.Why does it work for some page and why not for others?
Could you please provide me a solution for this....

I am guessing here, but XHTML is widely used, but wrong. Most people using
it have no clue what XHTML means, and hence use it like HTML and end up
with documents that are not well-formed. If you want to parse stuff that's
out on the web, use something like HTML::TreeBuilder.

If you make your own XHTML pages, you might want to think again, twice
even.
 
B

Brian McCauley

Hello guys,
I want to get the DOM of an XML page.for eg:an XML
page, being converted from HTML using Tidy,is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), seewww.w3.org">
<title></title>
</head>
<body>
</body>
</html>

Excuse me stating the obvious but that's not XML, it's HTML. It's tidy
HTML but still HTML. IIRC it's possible to instruct "tidy" to emit
xhtml (which is XML).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,743
Latest member
WoodrowMea

Latest Threads

Top