HTML Parsing and Indexing

M

mailtogops

Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish
 
B

Bernard

a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

(e-mail address removed) a écrit :
 
A

Andy Dingley

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?
 
S

Stefan Behnel

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan
 
P

Paul McGuire

I need a help on HTML parser.
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.

Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,737
Latest member
Georgeengab

Latest Threads

Top