HTML parsing as good as Perls.

T

TLOlczyk

First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.

However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.


With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.

I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries. Don't suggest I pass it through
Tidy then parse the XML. There are a lot of pages that Tidy can't
handle.

Finally, there will be some smartass, who will say that I should use
web sites that are written in good HTML. I don't have choice of what
pages I or the people to ask me to write scripts take our content
from. Fine. If you have the millions to pay all those webmasters to
hire HTML gurus that will generate good HTML let me know and
I will email you a list. As for me, I am too busy with real work on my
own projects to go around nagging people working on other things to
improve their coding style.

Thanks







The reply-to email address is (e-mail address removed).
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.
 
J

James Britt

TLOlczyk said:
First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.

However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.


With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.

Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.


James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
 
M

Mark Thomas

I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries.

Have you tried libxml2 in parse_html mode with the recover option on?
I've never had a problem with any site. It handles broken, nasty HTML
quite nicely.

(Disclaimer: I don't know if the Ruby bindings expose this
functionality).
 
D

Daniel Amelang

I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?

Here's the original BeautifulSoup. Look like what you need?

http://www.crummy.com/software/BeautifulSoup/

Would anyone be interested either as a user or a developer?

Dan
 
J

James Edward Gray II

I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?

Here's the original BeautifulSoup. Look like what you need?

http://www.crummy.com/software/BeautifulSoup/

Would anyone be interested either as a user or a developer?

I'm not a Python guy, so I don't know the library. However, I just
browsed through the site and if you ask me, it looks downright handy.

James Edward Gray II
 
E

Ezra Zygmuntowicz

I'm not a Python guy, so I don't know the library. However, I just
browsed through the site and if you ask me, it looks downright handy.

James Edward Gray II

+1 I would use it

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top