Help to extract data from a web page

smiledragon · Aug 25, 2007

Hi, I am newbie to XSLT, can you help me to write a XSLT to extract
article data from below web page? Thanks a lot

HTML page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>Untitled Document</title>
</head>
<body>
... Page Header ...
Page Title
Article Title
<table border="0" cellspacing="0" cellpadding="5">
<tr>
<td>Article Date </td>
<td>25/8/2007</td>
</tr>
<tr>
<td colspan="2">Hey, I want to extract Page Title, Article
Title, Article Date and Article Content, Request By.
Please help me to write XSLT code to extract article data?
 
 
Thanks.</td>
</tr>
<tr align="right">
<td colspan="2">Author David </td>
</tr>
</table>
... Page Footer ...
</body>
</html>

XML Result Page

<?xml version="1.0" encoding="UTF-8"?>
<HTMLPage>
<PageTitle>Page Title</PageTitle>
<ArticleTitle>Article Title</ArticleTitle>
<ArticleDate>25/8/2007</ArticleDate>
<ArticleBody>
Hey, I want to extract Page Title, Article Title, Article Date
 
Thanks.
</ArticleBody>
<Author>David</Author>
</HTMLPage>

Joe Kesselman · Aug 25, 2007

(Despite its name, microsoft.public.xsl doesn't let me post to it, so
you're only going to get an answer in comp.text.xml.)

XSLT is set up to process XML, not HTML. Your HTML document will not go
through an XML parser. So the firs thing you'll need to do is put it
through an HTML-to-XHTML conversion layer, such as the W3C's "tidy"
tool. (Alternatively you could feed the output of an HTML-to-XML parser,
such as NekoHTML, into an XSLT processor... but that will require a bit
more programming to hook those tools to each other.)

After doing that... what do you mean by "extract article data"? You're
writing a program, so you need to be explicit about what it's supposed
to do. Page title and article title are easy; look for elements with
the appropriate class attribute, using XPaths with predicates.

Article date is more of a pain since you need to search for the <td>
with the appropriate text value, then retrieve its following sibling's
value... unless you can count on the fact that it will always be in the
first <tr>, in which case you search for the second td of that tr.

Content -- Can you count on that being the second tr? If so, just
copying the contents of that seems to meet your need.

Author -- Again assuming that it's reliably going to be the third tr,
this is more of a pain because you're going to have to do string
manipulation to extract the author's name.

Having broken it down to this point, you really ought to be able to
complete the task yourself by consulting a good intro-to-XSLT tutorial.
Try it, and if you run into trouble come back with specific questions.

Martin Honnen · Aug 25, 2007

Joe said:
XSLT is set up to process XML, not HTML. Your HTML document will not go
through an XML parser. So the firs thing you'll need to do is put it
through an HTML-to-XHTML conversion layer, such as the W3C's "tidy"
tool. (Alternatively you could feed the output of an HTML-to-XML parser,
such as NekoHTML, into an XSLT processor... but that will require a bit
more programming to hook those tools to each other.)

If you don't want to program to hook those tools together then you can
use TSaxon <http://ccil.org/~cowan/XML/tagsoup/tsaxon/>, it then allows
you to use Saxon 6.5.5 to apply XSLT 1.0 transformations with both XML
and HTML input documents.

Help with my responsive home page	2	Dec 14, 2022
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Sort by number of characters	1	Nov 2, 2023
How to push data from one HTML page to another	4	Jan 3, 2024
Image shifts to the right when export the page to pdf	4	May 5, 2023
How to have two html audio players on one page?	0	May 3, 2022
Help with code	0	Jun 12, 2022
How to save JSON Data to a file using fetch() api?	2	Apr 28, 2022

Help to extract data from a web page

smiledragon

Joe Kesselman

Martin Honnen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads