Advice for html project

J

Jonathan Bale

I helping my boss with some scripting for a web analysis research
project. He handles vocabulary and analysis, while I am using ruby parse
WARC files and the actual HTML.

Anyway, I'm still fairly new to Ruby. I did the WARC parsing, but I was
wondering what I should use for the HTML parser. (Didn't want to
re-invent that wheel.) Some considerations:

* Mainly we just need to pull the content text out of the HTML
* A few tags might have special weight or significance (h1, etc.)
* Unfortunately, nearly all the HTML is broken, because all our test
data was provided by this software that truncates the data after a
certain length.
 
M

Marc Weber

Excerpts from Jonathan Bale's message of Wed Jul 14 01:38:31 +0200 2010:
I helping my boss with some scripting for a web analysis research
project. He handles vocabulary and analysis, while I am using ruby parse
WARC files and the actual HTML.

Anyway, I'm still fairly new to Ruby. I did the WARC parsing, but I was
wondering what I should use for the HTML parser. (Didn't want to
re-invent that wheel.) Some considerations:
Google for nokogiri. That's one solution.

Marc Weber
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,246
Members
46,841
Latest member
WilmerBelg

Latest Threads

Top