ASP Question: Parse HTML file?

R

Rob Meade

Hi all,

I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to
squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.

Thanks in advance for your help,

Regards

Rob


Example file:

<html>
<head>
<title>Novell 560 CNE Series: File System</title>
<meta name="Description" content="">
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
</head>
<body class="MlCatPage">
<table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="Logo" colspan="2">
<img class="Logo" src="../images/logo.gif">
</td>
</tr>
<tr>
<td class="Title">
<div class="ProductTitle">
<span class="CoCat">Novell 560 CNE Series: File System</span>
</div>
<div class="ProductDetails">
<span class="SmallText">
<span class="BoldText"> Product Code: </span>
560c04<span class="BoldText"> Time: </span>
4.0 hour(s)<span class="BoldText"> CEUs: </span>
Available</span>
</div>
</td>
<td class="Back">
<div class="BackButton">
<a href="javascript:history.back()">
<img src="../images/back.gif" align="right" border="0">
</a>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="HighLevel" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="sectiontext">Summary:</h3>
</td>
</tr>
<tr>
<td class="Overview">
<div class="ProductSummary">This course provides an introduction
to NetWare 5 file system concepts and management procedures.</div>
<br>
<h3 class="Sectiontext">Objectives:</h3>
<div class="FreeText">After completing this course, students will
be able to: </div>
<div class="ObjectiveList">
<ul class="listing">
<li class="ObjectiveItem">Explain the relationship of the file
system and login scripts</li>
<li class="ObjectiveItem">Create login scripts</li>
<li class="ObjectiveItem">Manage file system directories and
files</li>
<li class="ObjectiveItem">Map network drives</li>
</ul>
</div>
<br></br>
<h3 class="Sectiontext">Topics:</h3>
<div class="OutlineList">
<ul class="listing">
<li class="OutlineItem">Managing the File System</li>
<li class="OutlineItem">Volume Space</li>
<li class="OutlineItem">Examining Login Scripts</li>
<li class="OutlineItem">Creating and Executing Login
Scripts</li>
<li class="OutlineItem">Drive Mappings</li>
<li class="OutlineItem">Login Scripts and Resources</li>
</ul>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Technical Requirements:</h3>
</td>
</tr>
<tr>
<td class="Details">
<div class="ProductRequirements">200MHz Pentium with 32MB Ram. 800
x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
connection speed, broadband (256 kbps or greater) connection recommended.
Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
supported.</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Copyright Information:</h3>
</td>
</tr>
<tr>
<td class="Copyright">
<div class="ProductRequirements">Product names mentioned in this
catalog may be trademarks/servicemarks or registered trademarks/servicemarks
of their respective companies and are hereby acknowledged. All product
names that are known to be trademarks or service marks have been
appropriately capitalized. Use of a name in this catalog is for
identification purposes only, and should not be regarded as affecting the
validity of any trademark or service mark, or as suggesting any affiliation
between MindLeaders.com, Inc. and the trademark/servicemark
proprietor.</div>
<br>
<h3 class="Sectiontext">Terms of Use:</h3>
<div class="ProductUsenote"></div>
</td>
</tr>
</table>
<p align="center">
<span class="SmallText">Copyright &copy; 2006 MindLeaders. All rights
reserved.</span>
</p>
</body>
</html>
 
M

Mike Brind

Rob said:
Hi all,

I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.
 
A

Anthony Jones

I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no element
is output belonging to them which is the case here.

That's a long winded way of saying they don't do anything, ignore them.

It's a pity they didn't go the whole hog and output the whole page as XML it
would be a lot easier to do what you need. Still it's a good sign that the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.

Anthony.
 
R

Rob Meade

...
Consider displaying their page inside of an <iframe>
inside of a page that has your content.

Hi McKirahan,

Thanks for your reply - alas I need "bits" of their pages, with "bits" of my
stuff inserted in between, so including their whole page as-is unfortunately
is no good for me.

Regards

Rob
 
R

Rob Meade

...
If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

Hi Mike,

Thanks for your reply.

I don't suppose by any chance you might have an example that would get me
started with that approach would you - it sounds like it could well work.

Regards

Rob
 
R

Rob Meade

...
These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no
element
is output belonging to them which is the case here.

That's a long winded way of saying they don't do anything, ignore them.

It's a pity they didn't go the whole hog and output the whole page as XML
it
would be a lot easier to do what you need. Still it's a good sign that
the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.

Hi Anthony,

Thanks for the reply.

I especially appreciate the explanation for why they are there - I tried
googling it last night and found some stuff about XSLT 2.0 but it didn't
really get me anywhere - I would agree that it's a shame they are not as
XML - that would have been nice!

Cheers

Rob
 
M

McKirahan

Mike Brind said:
If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

It would have been nice if each div calss were unquie.
This one is repeated:
<div class="ProductRequirements">
It's not wrong just (potentially) inconvenient.

<td class="Details">
<div class="ProductRequirements">200MHz Pentium ...

<td class="Copyright">
<div class="ProductRequirements">Product names ...

Which div's are you interested in?


Here's a script that will extract all the div's into a new file:

Option Explicit
'*
Const cVBS = "Novell.vbs"
Const cOT1 = "Novell.htm" '= Input filename
Const cOT2 = "Novell.txt" '= Output filename
Const cDIV = "</div>"
'*
'* Declare Variables
'*
Dim intBEG
intBEG = 1
Dim arrDIV(9)
arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
arrDIV(1) = "ProductTitle"
arrDIV(2) = "ProductDetails"
arrDIV(3) = "ProductSummary"
arrDIV(4) = "FreeText"
arrDIV(5) = "ObjectiveList"
arrDIV(6) = "OutlineList"
arrDIV(7) = "ProductRequirements"
arrDIV(8) = "ProductRequirements"
arrDIV(9) = "ProductUsenote"
Dim intDIV
Dim strDIV
Dim arrOT1
Dim intOT1
Dim strOT1
Dim strOT2
Dim intPOS
'*
'* Declare Objects
'*
Dim objFSO
Set objFSO = CreateObject("Scripting.FileSystemObject")
Dim objOT1
Set objOT1 = objFSO.OpenTextFile(cOT1,1)
Dim objOT2
Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
'*
'* Read File, Extract "div", Write Line
'*
strOT1 = objOT1.ReadAll()
For intDIV = 1 To UBound(arrDIV)
strOT2 = Mid(strOT1,intBEG)
strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
intPOS = InStr(strOT2,strDIV)
If intPOS > 0 Then
strOT2 = Mid(strOT2,intPOS)
intPOS = InStr(strOT2,cDIV)
strOT2 = Left(strOT2,intPOS+Len(cDIV))
objOT2.WriteLine(strOT2 & vbCrLf)
intBEG = intPOS + Len(cDIV) + 1
End If
Next
'*
'* Destroy Objects
'*
Set objOT1 = Nothing
Set objOT2 = Nothing
Set objFSO = Nothing
'*
'* Done!
'*
MsgBox "Done!",vbInformation,cVBS

You could modify it to loop through a list or folder of files.

Note that each "class=" is in the stylesheet:
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
which you should refer to when using their div's.
 
R

Rob Meade

...

Hi McKirahan, thank you again for your reply and example.

I should add that I wont be writing these out to another file, instead it'll
need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.

With the example you posted (below) - does it extract whats between the DIV
tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

Thanks again

Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend
on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make
more sense anyway, this could then create new pages, or perhaps store the
information in the database.
 
M

McKirahan

Rob Meade said:
...

Hi McKirahan, thank you again for your reply and example.

I should add that I wont be writing these out to another file, instead it'll
need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.

With the example you posted (below) - does it extract whats between the DIV
tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

Thanks again

Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend
on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make
more sense anyway, this could then create new pages, or perhaps store the
information in the database.

Did you try it as-is to see what you get?

I would probably put all 1300 files (pages) in a single folder.
Then run a process against each to generate 1300 new files in
a different folder. These would be posted for quick access.

Prior to posting the could be reviewed for accuracy.

Also, instead of extracting out the div's you could just identify
where you want your stuff inserted.
 
R

Rob Meade

...
Did you try it as-is to see what you get?

Hi McKirahan, thanks for your reply.

Not as of yet no - but I'm home this weekend so will be giving it ago :eek:)
I would probably put all 1300 files (pages) in a single folder.

They come in a /courses directory
Then run a process against each to generate 1300 new files in
a different folder. These would be posted for quick access.

I think I might have to change the process a bit but the idea is the same -
the content provider has other bits that link to these files, so they'd
still need to be in a /courses directory, but I could put them somewhere
else first, "mangle" them and then spit them out to the /courses directory
:eek:)
Prior to posting the could be reviewed for accuracy.

I might check a couple - but not all 1300 - I dont wanna go mental... :eek:D
Also, instead of extracting out the div's you could just identify
where you want your stuff inserted.

Yeah, but there were bits I needed to lose, ie the copyright section etc..

I seem to remember a long time back a discussion about transforming pages, I
think it might have been done in an ISAPI filter or something though - not
sure - from what I remember the requested page would get grabbed, actions
happen and then it can be spat out as a different page - I wonder if this is
what the previous company that did this adopted, because I find it hard to
believe they would have created 1300 asp files, but yet all of the links on
the original site were <course-code>.asp as opposed to the real file
<course-code.html - if you see what I mean...

Regards

Rob
 
M

McKirahan

[snip]
I seem to remember a long time back a discussion about transforming pages, I
think it might have been done in an ISAPI filter or something though - not
sure - from what I remember the requested page would get grabbed, actions
happen and then it can be spat out as a different page - I wonder if this is
what the previous company that did this adopted, because I find it hard to
believe they would have created 1300 asp files, but yet all of the links on
the original site were <course-code>.asp as opposed to the real file
<course-code.html - if you see what I mean...

An approach they could have taken was to store the "sections" in a database
table -- one memo field per section -- then generate static pages from it.

Thus, the header, navigation, and footer could be modified independently.
 
R

Rob Meade

...
An approach they could have taken was to store the "sections" in a
database
table -- one memo field per section -- then generate static pages from it.

Thus, the header, navigation, and footer could be modified independently.

I suspect the company does have this, but they most likely use it for the
generation of these files which they then sell on etc...

The one thing I do have missing at the moment is a nice file that ties the
<course_code.html> file names (or just the codes) - to the titles of the
courses!

They give you a "contents.html" file which has all of the courses listed and
the codes / files as hyperlinks - but again it would mean parsing the entire
file to get at the goodies, I'm going to ask them if they have the same
thing in XML/Database or something to hopefully make that a bit easier..

Thanks again for your help - alas due to my 9 month old son I have yet to
get around to trying your example! But I will :eek:)

Rob
 
M

Mike Brind

When you do get to try Rob's code, you will see that it opens a number
of possibilities - one of which is to insert the contents of the divs
into an database instead of writing them to 1300 text files. I really
can't understand why this is not at the top of your list of options -
manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
know a lot more about your project then I do :)

If you were using Rob's code, you can insert this into it:

If intDiv = 2 Then
Dim re, m, myMatches, pcode
Set re = New RegExp
With re
.Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
.IgnoreCase = True
.Global = True
End With
Set myMatches = re.Execute(strOT2)
For Each m In myMatches
If m.Value <>"" Then
pcode = Replace(m.Value,"Product Code: </span>","")
pcode = Replace(pcode," ","")
pcode = Replace(pcode,chr(10),"")
pcode = Replace(pcode,chr(13),"")
Response.Write pcode 'or write to db
End If
Next
Set re = Nothing
End If

And that will return the Product Code on it's own. Change the pattern
to "<title>[\.]*</title>" and you get the title stripped out too.
 
R

Rob Meade

...
When you do get to try Rob's code, you will see that it opens a number
of possibilities - one of which is to insert the contents of the divs
into an database instead of writing them to 1300 text files. I really
can't understand why this is not at the top of your list of options -
manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
know a lot more about your project then I do :)

If you were using Rob's code, you can insert this into it:

If intDiv = 2 Then
Dim re, m, myMatches, pcode
Set re = New RegExp
With re
.Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
.IgnoreCase = True
.Global = True
End With
Set myMatches = re.Execute(strOT2)
For Each m In myMatches
If m.Value <>"" Then
pcode = Replace(m.Value,"Product Code: </span>","")
pcode = Replace(pcode," ","")
pcode = Replace(pcode,chr(10),"")
pcode = Replace(pcode,chr(13),"")
Response.Write pcode 'or write to db
End If
Next
Set re = Nothing
End If

And that will return the Product Code on it's own. Change the pattern
to "<title>[\.]*</title>" and you get the title stripped out too.

Hi Mike,

Thanks for your reply - something else to try with it - very much
appreciated, thank you.

Regards

Rob
PS: It's McKirahan's code ;o)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top