Extracing data from XHTML file into another

C

chris_huh

Is there a way to extract data from one xhtml file and create another
one with it. I want to create a basic file with all the headlines from
a news page listed in it (like an rss feed).
 
M

Martin Honnen

chris_huh said:
Is there a way to extract data from one xhtml file and create another
one with it. I want to create a basic file with all the headlines from
a news page listed in it (like an rss feed).

Well XHTML is supposed to be XML so in theory you should be able to use
any XML parser respectively XML API to extract data, like XPath, XSLT,
XQuery. In practice however lots of XHTML is served as text/html and is
often not well-formed XML so XML parsers might fail to process it.
 
J

Joe Kesselman

Martin said:
Well XHTML is supposed to be XML so in theory you should be able to use
any XML parser respectively XML API to extract data, like XPath, XSLT,
XQuery. In practice however lots of XHTML is served as text/html and is
often not well-formed XML so XML parsers might fail to process it.

Even if served as text/html, appropriate software could recognize and
process it as XHTML and thus XML. You'd have to either know to expect
XHTML or have a prebuffer/prescan pass to check that.

Not that if it isn't well-formed XML, it really isn't XHTML, no matter
what the document's contents claim. That's one of the major differences
between XHTML and HTML -- HTML was SGML-based and allowed some
shortcuts/sloppiness that the XML-based XHTML doesn't.
 
C

chris_huh

Even if served as text/html, appropriate software could recognize and
process it as XHTML and thus XML. You'd have to either know to expect
XHTML or have a prebuffer/prescan pass to check that.

Not that if it isn't well-formed XML, it really isn't XHTML, no matter
what the document's contents claim. That's one of the major differences
between XHTML and HTML -- HTML was SGML-based and allowed some
shortcuts/sloppiness that the XML-based XHTML doesn't.

I have tried using a server-based approach (using the tutorial from
the w3schools sites - http://www.w3schools.com/xsl/xsl_server.asp) but
it doesn't seem to accept the xhtml file (the asp file just
continously loads). Is this taking the wrong tactic. I don't know much
about xml.

What i have is an xhtml file (with a .shtml extension) which is used
as an index page for a section in a news site (i have ten sections).
And i want another file made making an unordered list of the items on
that index page. On each index page there are three top stories, and i
want another file created that holds these three stories in a <ul>.
These 10 generated files (there will be one for each section) will
then be included in the top index page. Does that explain it well?
 
M

Martin Honnen

chris_huh said:
I have tried using a server-based approach (using the tutorial from
the w3schools sites - http://www.w3schools.com/xsl/xsl_server.asp) but
it doesn't seem to accept the xhtml file (the asp file just
continously loads). Is this taking the wrong tactic. I don't know much
about xml.

What i have is an xhtml file (with a .shtml extension) which is used
as an index page for a section in a news site (i have ten sections).

Aren't .shtml files usually ones making use of SSI (server-side
includes)? Are you loading the document from the file system or over
HTTP? Can you post the URL to a sample document?
 
C

chris_huh

Aren't .shtml files usually ones making use of SSI (server-side
includes)? Are you loading the document from the file system or over
HTTP? Can you post the URL to a sample document?

Yeah, i am using ssi to include other files in .shtml files. I cant
sent a link as i am using it on a closed server.

I am using:

<!--#include virtual="/includes/navigation.sssi" -->

to include the files.i guess that means it is over HTTP. The idea (if
this is even possible) is to include each of these created files using
this similar function.
 
M

Martin Honnen

chris_huh said:
Yeah, i am using ssi to include other files in .shtml files. I cant
sent a link as i am using it on a closed server.

I am using:

<!--#include virtual="/includes/navigation.sssi" -->

to include the files.i guess that means it is over HTTP.

If you use SSI then reading a file from the file system would not
process any of those SSI instructions.

It is hard to tell what goes wrong without being able to check the
X(HT)ML documents you have. If you use classic ASP and have troubles
getting your code to work then you might want to ask in a newsgroup
dedicated to that: microsoft.public.inetserver.asp.general
 
C

chris_huh

If you use SSI then reading a file from the file system would not
process any of those SSI instructions.

It is hard to tell what goes wrong without being able to check the
X(HT)ML documents you have. If you use classic ASP and have troubles
getting your code to work then you might want to ask in a newsgroup
dedicated to that: microsoft.public.inetserver.asp.general

Yeah, i suppose it could be more of an issue with asp. The asp code i
use is:

<%
'Load XML
set xml = Server.CreateObject("Microsoft.XMLDOM")
xml.async = false
xml.load(Server.MapPath("/iraq/index.shtml"))

'Load XSL
set xsl = Server.CreateObject("Microsoft.XMLDOM")
xsl.async = false
xsl.load(Server.MapPath("/includes/style.xsl"))

'Transform file
Response.Write(xml.transformNode(xsl))
%>

When the Server.MapPath is a .xml file it works ok, but when it is
a .shtml file, loading from the actual file itself it just crashes. It
could be something wrong with the shtml file (maybe it isn't correct
xhtml) or maybe you can't use an shtml file. Thats what i wasn't sure
about.
 
M

Martin Honnen

chris_huh said:
Yeah, i suppose it could be more of an issue with asp. The asp code i
use is:

<%
'Load XML
set xml = Server.CreateObject("Microsoft.XMLDOM")
xml.async = false
xml.load(Server.MapPath("/iraq/index.shtml"))

When the Server.MapPath is a .xml file it works ok, but when it is
a .shtml file, loading from the actual file itself it just crashes. It
could be something wrong with the shtml file (maybe it isn't correct
xhtml) or maybe you can't use an shtml file. Thats what i wasn't sure
about.

You can check for parse errors with MSXML as follows, put that after the
load call:
If xml.parseError.errorCode <> 0 Then
Response.Write xml.parseError.reason
End If
 
C

chris_huh

You can check for parse errors with MSXML as follows, put that after the
load call:
   If xml.parseError.errorCode <> 0 Then
     Response.Write xml.parseError.reason
   End If

I tried that but the script just times out.

Also i tried to validate the .shtml file (which is xhtml) and the only
errors were some xml markup. I put in <headline> tags for each title
so that i could extract it and it says that they are wrong? Is that
not how you do this?
 
C

chris_huh

You can check for parse errors with MSXML as follows, put that after the
load call:
   If xml.parseError.errorCode <> 0 Then
     Response.Write xml.parseError.reason
   End If

The asp finally loaded and came back with:

The system cannot locate the resource specified. Error processing
resource 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'.

Which i guess is meaning the line at the top of the xhtml file. Maybe
it can't read taht properly.
 
C

chris_huh

Dpoes it work when you set
   xml.resolveExternals = False
before the load call?

Now it comes back with: The element 'html' is used but not declared in
the DTD/Schema.
 
M

Martin Honnen

chris_huh said:
Now it comes back with: The element 'html' is used but not declared in
the DTD/Schema.

Add
xml.validateOnParse = False
before the load call.
 
C

chris_huh

Add
   xml.validateOnParse = False
before the load call.

I've got the code like this now, and it just procudes a blank page:

<%
'Load XML
set xml = Server.CreateObject("Microsoft.XMLDOM")
xml.async = false
xml.validateOnParse = false
xml.resolveExternals = false
xml.load(Server.MapPath("/iraq/index.shtml"))

If xml.parseError.errorCode <> 0 Then
Response.Write xml.parseError.reason
End If

'Load XSL
set xsl = Server.CreateObject("Microsoft.XMLDOM")
xsl.async = false
xsl.load(Server.MapPath("/includes/topstyle.xsl"))

'Transform file
Response.Write(xml.transformNode(xsl))
%>
 
M

Martin Honnen

chris_huh said:
I've got the code like this now, and it just procudes a blank page:

<%
'Load XML
set xml = Server.CreateObject("Microsoft.XMLDOM")
xml.async = false
xml.validateOnParse = false
xml.resolveExternals = false
xml.load(Server.MapPath("/iraq/index.shtml"))

If xml.parseError.errorCode <> 0 Then
Response.Write xml.parseError.reason
End If

'Load XSL
set xsl = Server.CreateObject("Microsoft.XMLDOM")
xsl.async = false
xsl.load(Server.MapPath("/includes/topstyle.xsl"))

'Transform file
Response.Write(xml.transformNode(xsl))
%>

Well I would first debug the stylesheet in an XML editor to ensure it
produces the output you want before running it in ASP.
If you need help with the stylesheet then you need to share the XML
input and the XSLT stylesheet.
 
C

chris_huh

Well I would first debug the stylesheet in an XML editor to ensure it
produces the output you want before running it in ASP.
If you need help with the stylesheet then you need to share the XML
input and the XSLT stylesheet.

The asp file is making the correct coding apart from the foreach
stuff.

So at the moment i have this for the ASP:

<%
'Load XML
set xml = Server.CreateObject("Microsoft.XMLDOM")
xml.async = false
xml.validateOnParse = false
xml.resolveExternals = false
xml.load(Server.MapPath("/iraq/test.shtml"))

If xml.parseError.errorCode <> 0 Then
Response.Write xml.parseError.reason
End If

'Load XSL
set xsl = Server.CreateObject("Microsoft.XMLDOM")
xsl.async = false
xsl.load(Server.MapPath("/includes/topstyle.xsl"))

'Transform file
Response.Write(xml.transformNode(xsl))
%>

This for the XSL:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:eek:utput method="xml" version="1.0" encoding="UTF-8" doctype-
public="-//W3C//DTD XHTML 1.1//EN" doctype-system="http://www.w3.org/
TR/xhtml11/DTD/xhtml11.dtd" indent="yes"/>

<xsl:template match="/">
<html>
<head>
<title>Test</title>
</head>
<body>
<ul>
<xsl:for-each select="html/body/item">
<li>
<xsl:value-of select="headline" />
</li>
</xsl:for-each>
</ul>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

And this for the XHTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="EN" lang="EN">
<head>

<title>Test stories</title>

</head>

<body class="iraq">

<item>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="226" rowspan="2" valign="top"><a href="#"><img src="../
images/placeholder226.jpg" alt="Story 1" name="Story1image"
width="226" height="170" id="Story1image" /></a></td>
<td width="10" rowspan="2" valign="top"></td>
<td valign="top"><h2 class="itemheader"><a href="#"
class="itemlink"><headline>Story 1 headline</headline></a></h2></td>
</tr>
<tr>
<td valign="top" class="itemdescription">Story 1 summary</td>
</tr>
<tr>
<td height="10" colspan="3" valign="top"></td>
</tr>
</table></item>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="238" align="left" valign="top">
<item>
<table width="228" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="76" valign="top"><a href="#"><img src="../images/
placeholder66.jpg" alt="Story 2" name="Story2image" width="66"
height="49" id="Story2image" /></a></td>
<td valign="top"><h3 class="itemheader"><a href="#"
class="itemlink"><headline>Story 2 headline</headline></a></h3></td>
</tr>
<tr>
<td colspan="2" valign="top" height="10"></td>
</tr>
<tr>
<td colspan="2" valign="top" class="itemdescription">Story 2
summary</td>
</tr>
</table></item></td>
<td align="right" valign="top"><item><table width="228" border="0"
cellspacing="0" cellpadding="0">
<tr>
<td width="76" valign="top"><a href="#"><img src="../images/
placeholder66.jpg" alt="Story 3" name="Story3image" width="66"
height="49" id="Story2image2" /></a></td>
<td valign="top"><h3 class="itemheader"><a href="#"
class="itemlink"><headline>Story 3 headline</headline></a></h3></td>
</tr>
<tr>
<td colspan="2" valign="top" height="10"></td>
</tr>
<tr>
<td colspan="2" valign="top" class="itemdescription">Story 3
summary</td>
</tr>
</table></item></td>
</tr>
</table>


</body>

</html>

Obviously i want to change the XHTML to look nicer but i am just
trying to get this working. I know the ASP and XSL work because if i
use an XML file instead of XHTML it works:

<?xml version="1.0" encoding="ISO-8859-1"?>
<news>
<item><headline>Story 1 headline</headline></item>
<item><headline>Story 2 headline</headline></item>
<item><headline>Story 3 headline</headline></item>
</news>
 
M

Martin Honnen

chris_huh said:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

You need
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xhtml"
here on the xsl:stylesheet element to access XHTML elements and to
create XHTML elements.

<xsl:template match="/">
<html>
<head>
<title>Test</title>
</head>
<body>
<ul>
<xsl:for-each select="html/body/item">

You need to qualify all element names with the prefix 'xhtml' I defined
above e.g.
select="xhtml:html/xhtml:body/xhtml:item"
<li>
<xsl:value-of select="headline" />

Same here
select="xhtml:headline"

and so one everywhere you want to access or match elements from the
source document.
 
C

chris_huh

You need
   xmlns="http://www.w3.org/1999/xhtml"
   xmlns:xhtml="http://www.w3.org/1999/xhtml"
   exclude-result-prefixes="xhtml"
here on the xsl:stylesheet element to access XHTML elements and to
create XHTML elements.


You need to qualify all element names with the prefix 'xhtml' I defined
above e.g.
                     select="xhtml:html/xhtml:body/xhtml:item"


Same here
                       select="xhtml:headline"

and so one everywhere you want to access or match elements from the
source document.

Now i have the xsl like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xhtml" >

<xsl:eek:utput method="xml" version="1.0" encoding="UTF-8" doctype-
public="-//W3C//DTD XHTML 1.1//EN" doctype-system="http://www.w3.org/
TR/xhtml11/DTD/xhtml11.dtd" indent="yes"/>

<xsl:template match="/">
<html>
<head>
<title>Test</title>
</head>
<body>
<ul>
<xsl:for-each select="xhtml:html/xhtml:body/xhtml:item">
<li>
<xsl:value-of select="xhtml:headline" />
</li>
</xsl:for-each>
</ul>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

which works a bit better (the Li items show up) but there still isn't
anything inside the li items.
 
M

Martin Honnen

chris_huh said:
<xsl:for-each select="xhtml:html/xhtml:body/xhtml:item">
<li>
<xsl:value-of select="xhtml:headline" />

I think those headline elements are deep down inside the table so you
either have to spell out the complete path or use
<xsl:value-of select="descendant::xhtml:headline"/>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top