G
George Durzi
I'd like to screen-scrape company news from cbsmarketwatch. Consider this
URL as an example:
http://cbs.marketwatch.com/tools/quotes/news.asp?symb=MSFT When you browse
there, there's two sections, 1. News Headlines for Microsoft Corporation,
and 2. Press Releases about Microsoft Corporation.
I've already written the code to post to the page and grab the HTML into a
string. If you browse the source of the above linked webpage, here's an
excerpt of how the news headlines would look:
<TABLE WIDTH="100%" CELLPADDING="0" CELLSPACING="0" border="0" ID="Table1">
<?xml version="1.0" encoding="UTF-16" ?>
<TR class="tb01">
<TD COLSPAN="4" height="20">
<A class="lk03"
href="/tools/quotes/news.asp?siteid=mktw&symb=MSFT&property=sid&valu
e=3140&doctype=2006">News Headlines for Microsoft Corporation (MSFT)</A>
</TD>
</TR>
<TR>
<TD NOWRAP="TRUE" width="110" valign="top">12:58pm 02/13/04</TD>
<TD valign="top">
<A class="lk01"
HREF="/news/story.asp?guid=%7B01470A47%2D936B%2D444D%2DB6FC%2DD111A9E61EE4%7
D&siteid=mktw&">Market Snapshot</A>
</td>
</TR>
</TABLE>
What I'd like to do is create a dataset (or anything else I can bind to a
datagrid) containing the news items.
I noticed that the news items are enclosed in a table which has <?xml
version="1.0" encoding="UTF-16" ?>
Would this allow me an easy way to navigate this HTML?
What tools can I use to do this? Regular Expressions?
Any tips are greatly appreciated.
URL as an example:
http://cbs.marketwatch.com/tools/quotes/news.asp?symb=MSFT When you browse
there, there's two sections, 1. News Headlines for Microsoft Corporation,
and 2. Press Releases about Microsoft Corporation.
I've already written the code to post to the page and grab the HTML into a
string. If you browse the source of the above linked webpage, here's an
excerpt of how the news headlines would look:
<TABLE WIDTH="100%" CELLPADDING="0" CELLSPACING="0" border="0" ID="Table1">
<?xml version="1.0" encoding="UTF-16" ?>
<TR class="tb01">
<TD COLSPAN="4" height="20">
<A class="lk03"
href="/tools/quotes/news.asp?siteid=mktw&symb=MSFT&property=sid&valu
e=3140&doctype=2006">News Headlines for Microsoft Corporation (MSFT)</A>
</TD>
</TR>
<TR>
<TD NOWRAP="TRUE" width="110" valign="top">12:58pm 02/13/04</TD>
<TD valign="top">
<A class="lk01"
HREF="/news/story.asp?guid=%7B01470A47%2D936B%2D444D%2DB6FC%2DD111A9E61EE4%7
D&siteid=mktw&">Market Snapshot</A>
</td>
</TR>
</TABLE>
What I'd like to do is create a dataset (or anything else I can bind to a
datagrid) containing the news items.
I noticed that the news items are enclosed in a table which has <?xml
version="1.0" encoding="UTF-16" ?>
Would this allow me an easy way to navigate this HTML?
What tools can I use to do this? Regular Expressions?
Any tips are greatly appreciated.