Parsing HTML

M

Mohammad-Reza

Hi
I want to parse a web page (in a web service) and retrive some of its
information. I googled the MSDN and found a walkthrough (How to: Create Web
Services That Parse the Contents of a Web Page) but the walkthrogh is a
little complex and the writer did not completly describe all the aspects of
the solution.
Could any one elaborate on this walkthrough? Or direct me to another (or
better) way to deal with such a problem.

Thanks in advance.
 
S

Scott M.

How about using the W3C Document Object Model, which was designed to do just
what you are trying to do?
 
M

Mohammad-Reza

I want to write a web service that extracts some information from a web page
and use that web service in a windows application. I think the usual solution
for parsing is a little bit slow and costs too much (getting HTML code and
finding the keys using loops). I want to know if there is any possible way in
..NET to simply extract those information (for example a method that returns
every HTML tag of the web page with its value)?
The process time of the web service is very important for me.

Thanks in advance.
 
S

Scott M.

I don't know where you have gotten your information, but this is exactly
what the DOM is for.
 
J

John Saunders

Scott M. said:
I don't know where you have gotten your information, but this is exactly
what the DOM is for.

Scott,

I used this approach with a Windows Forms application back in 2001, with
..NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I used
the ActiveX Internet Browser control to load the page I was interested in,
and once the page was loaded, I could access the DOM from C# code. Did you
have a different technique in mind when you talk about the DOM?

Perhaps a faster technique would be to use regular expressions to parse the
HTML and find what you're looking for.

John
 
S

Scott M.

What I had in mind was, if the HTML in question was well-formed (XHTML), you
could just load it into an XMLDocument (from a string) object and use the
XML DOM to parse from there.
 
S

Scott M.

Well, XHTML is XML, so you'd really be loading XML into an XMLDocument, but
once it's loaded, you can parse out whatever you like using the DOM.

Dim xmlDoc As New System.XML.XMLDocument()
'You can load the XML in one of two ways...

'docPath represents a path to an file containing the XML
xmlDoc.Load(docPath)

'or
'Here you can load a string directly
xmlDoc.LoadXML(string)

'Example of getting all the paragraph tags and then the text of the first
one using the DOM...
dim theParagraphs As XMLNodeList = xmlDoc.GetElementsByTagName("P")
dim firstParagraphText As String = theParagraphs(0).Text


-Scott
 
J

John Saunders

Scott M. said:
What I had in mind was, if the HTML in question was well-formed (XHTML),
you could just load it into an XMLDocument (from a string) object and use
the XML DOM to parse from there.

That works well for XHTML. The problem is that most web sites are still
using HTML, which is not well-formed XML.

John
 
S

Scott M.

But, we're not talking about most web pages. We are talking about a
particular page that is being used with a web service. In other words, it's
part of the OP's applicaiton, which he should have some control over.
 
J

John Saunders

Scott M. said:
But, we're not talking about most web pages. We are talking about a
particular page that is being used with a web service. In other words,
it's part of the OP's applicaiton, which he should have some control over.

Sorry, I didn't recall that he said it was his application. I assumed he was
scraping from somebody else's application.

Even though it's his, there may be reasons why he can't guarantee that the
page he needs will be XHTML and will be guaranteed to remain XHTML.

John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,815
Latest member
treekmostly22

Latest Threads

Top