Regular Expression Validator (screen-scrape)

S

Sparky Arbuckle

I'm trying to scrape the news from a page on my University's server.
The HTML below is what I need to get. The only problem is that I am not
that good with Regular Expressions so I was wondering if someone could
help me by either telling me how to go about figuring it out or
suggesting a good tutorial site that will make sense of this issue?
Thanks in advance!

<td width="70">4/12/2005</td><td>&nbsp;</td><td><b><a
href="/ucomm_news/articles/822.asp">Fairhaven College Hosts Human
Rights Film Festival April 13-17; April
20-24</a></b></td></tr><tr><td>&nbsp;</td><td>&nbsp;</td><td><P>BELLINGHAM
- Western Washington University's Fairhaven College will host the
Human Rights Film Festival on April 13-17 and April 20-24.</P>
</td></tr><tr><td colspan="3">
 
S

Sparky Arbuckle

I've simplified the code above.

I need everything from:

<td width="70"> to <td colspan="3">

Wouldn't it look something like:

<td width="70">(.\n)*?td colspan="3">

??
 
J

Juan T. Llibre

Hi, Sparky,

Try this :

<td width="70"[^>]*>(.*?)<td colspan="3">
or this:
<td width="70">[^>]*>(.*?)<td colspan="3">
( not exactly sure which of the 2... )

Take notice that the *first* <td colspan="3"> tag will close the search.

The general rule is :

<tag[^>]*>(.*?)<endtag>
 
S

Sparky Arbuckle

Thanks Juan! I got it to work by using:

lblOutput.text = funScrape(strHTML, "<td width=(.)*?<td colspan=")


Now I am trying to take it a step further. I have
created a FOR NEXT in my code to try and edit the following HTML so
that I can remove everything from the <P> to </P> tags.

<td width="70">4/20/2005</td><td>&­nbsp;</td><td><b><a
href="/ucomm_news/articles/834­.asp">Fairhaven College to Host World
Issues Forums April 25,
27</


a></b></td></tr><tr><td>&nbsp;­</td><td>&nbsp;</td><td><P>BEL­LINGHAM

Western Washington


Universitys Fairhaven College will host two World
Issues Forums on April 25 and April 27. The forums are free and open to

the public.</P>
</td>

Ultimately I want to display only the Date <td width="70">DATE</TD> and

the hyperlink. I'm trying to use this FOR NEXT Loop:


IF objMatchCollection.Count > 0 THEN
FOR EACH objMatch in objMatchCollection
iStart = inStr(objMatch.Value,">")
iEnd = inStr(iStart,objMatch.Value,"<­/P>")
iStart2 = inStr(objMatch.Value, "<P>")
iEnd2 = len(objMatch.Value)
Response.write(objMatch.Index & objMatch.Value)
NEXT
ELSE
Response.write("No matches for " & strPattern)
END IF


Do I need to clarify or does this make sense?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top