HTML Parsing and Indexing

mailtogops · Nov 13, 2006

Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

Fredrik Lundh · Nov 13, 2006

I need a help on HTML parser.

http://www.effbot.org/pyfaq/tutor-how-do-i-get-data-out-of-html.htm

</F>

Bernard · Nov 13, 2006

a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

(e-mail address removed) a écrit :

Andy Dingley · Nov 13, 2006

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

Stefan Behnel · Nov 14, 2006

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan

Paul McGuire · Nov 16, 2006

I need a help on HTML parser.

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.

Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

Web Page Parsing/Downloading	1	Nov 22, 2013
Parsing an html line and pulling out only numbers that meet a certain criteria	2	Sep 12, 2013
HTML Aligning social media icons	2	Dec 6, 2020
HTML File Parsing	3	Oct 28, 2008
HTML Parsing	5	Feb 10, 2007
Parsing HTML	3	Feb 10, 2007
HTMLParser not parsing whole html file	4	Oct 24, 2010
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

HTML Parsing and Indexing

mailtogops

Fredrik Lundh

Bernard

Andy Dingley

Stefan Behnel

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads