beautifulsoup .vs tidy

B

bruce

hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction....

thanks

-bruce
(e-mail address removed)
 
R

Ravi Teja

bruce said:
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction....

1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.
2.) BeautifulSoup is not a validator but works well with bad HTML. Also
look at Mechanize and ClientForm.
3.) XMLStarlet is a good XML validator
(http://xmlstar.sourceforge.net/). It's not Python but you don't need
to care about the language it is written in.
4.) For a simple HTML validator, Just use http://validator.w3.org/
 
P

Paddy

bruce said:
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...
I'm not too sure of what you are after. You mention tidy in the subject
which made me think that maybe you were trying to generate well-formed
HTML from malformed webppages that nonetheless browsers can interpret.
If that is the case then try HTML tidy:
http://www.w3.org/People/Raggett/tidy/

- Pad.
 
F

Fredrik Lundh

bruce said:
that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file.

what exactly do they complain about ?

</F>
 
P

Paul Boddie

Ravi said:
1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

Paul
 
M

Matt Good

bruce said:
that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file. my
thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
functions are too strict!

Clean HTML is not valid XML. If you want to process the output with an
XML library you'll need to tell Tidy to output XHTML. Then it should
be valid for XML processing.

Of course BeautifulSoup is also a very nice library if you need to
extract some information, but don't necessarilly require XML processing
to do it.

-- Matt Good
 
R

Ravi Teja

Paul said:
Ravi said:
1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

I wasn't aware that your module does html as well.
Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.
 
F

Fredrik Lundh

Ravi said:
No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.

so? once you run it through an HTML-aware parser, the *resulting*
structure is well formed.

a site generator->converter->xpath approach is no less reliable than any
other HTML-scraping approach.

</F>
 
U

uche.ogbuji

bruce said:
hi paddy...

that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file. my
thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
functions are too strict!

which is why i decided to see if anyone on the python side has
experienced/solved this problem..

FWIW here's my usual approach:

http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM

Personally, I avoid Tidy. I've too often seen it crash or hang on
really bad HTML. TagSoup seems to be built like a tank. I've also
never seen BeautifulSoup choke, but I don't use it as much as TagSoup.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,293
Messages
2,571,501
Members
48,189
Latest member
StaciLgf76

Latest Threads

Top