beautifulsoup .vs tidy

bruce · Jul 1, 2006

hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction....

thanks

-bruce
(e-mail address removed)

Ravi Teja · Jul 1, 2006

bruce said:
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction....

1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.
2.) BeautifulSoup is not a validator but works well with bad HTML. Also
look at Mechanize and ClientForm.
3.) XMLStarlet is a good XML validator
(http://xmlstar.sourceforge.net/). It's not Python but you don't need
to care about the language it is written in.
4.) For a simple HTML validator, Just use http://validator.w3.org/

Paddy · Jul 1, 2006

bruce said:
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

I'm not too sure of what you are after. You mention tidy in the subject
which made me think that maybe you were trying to generate well-formed
HTML from malformed webppages that nonetheless browsers can interpret.
If that is the case then try HTML tidy:
http://www.w3.org/People/Raggett/tidy/

- Pad.

Fredrik Lundh · Jul 1, 2006

bruce said:
that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file.

what exactly do they complain about ?

</F>

Paul Boddie · Jul 1, 2006

Ravi said:
1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

Paul

Matt Good · Jul 1, 2006

bruce said:
that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file. my
thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
functions are too strict!

Clean HTML is not valid XML. If you want to process the output with an
XML library you'll need to tell Tidy to output XHTML. Then it should
be valid for XML processing.

Of course BeautifulSoup is also a very nice library if you need to
extract some information, but don't necessarilly require XML processing
to do it.

-- Matt Good

Ravi Teja · Jul 1, 2006

Paul said:
Ravi said:

1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.

Click to expand...

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

I wasn't aware that your module does html as well.

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.

Fredrik Lundh · Jul 2, 2006

Ravi said:
No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.

so? once you run it through an HTML-aware parser, the *resulting*
structure is well formed.

a site generator->converter->xpath approach is no less reliable than any
other HTML-scraping approach.

</F>

uche.ogbuji · Jul 3, 2006

bruce said:
hi paddy...

that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file. my
thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
functions are too strict!

which is why i decided to see if anyone on the python side has
experienced/solved this problem..

FWIW here's my usual approach:

http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM

Personally, I avoid Tidy. I've too often seen it crash or hang on
really bad HTML. TagSoup seems to be built like a tank. I've also
never seen BeautifulSoup choke, but I don't use it as much as TagSoup.

HTML purifier using BeautifulSoup?	1	Dec 21, 2004
mechanize select_form issue..	0	Jul 10, 2006
python guru.. for a short conversation regarding bittorrent..	2	Oct 2, 2005
python guru... ViewCVS	0	Apr 13, 2005
testing xml against xpather with firefox	0	Feb 16, 2009
python/svn issues....	1	Apr 12, 2005
Java Vs Perl	20	May 10, 2006
trying to parse a file...	0	Apr 18, 2005

beautifulsoup .vs tidy

bruce

Ravi Teja

Paddy

Fredrik Lundh

Paul Boddie

Matt Good

Ravi Teja

Fredrik Lundh

uche.ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads