HTML purifier using BeautifulSoup?

D

Dan Stromberg

Has anyone tried to construct an HTML janitor script using BeautifulSoup?

My situation:

I'm trying to convert a series of web pages from .html to palmdoc format,
using plucker, which is written in python. The plucker project suggests
passing html through "tidy", to get well-formed html for plucker to work
with.

However, some of the pages I want to convert are so bad that even tidy
pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad
html... Which led me to the question this article started out with. :)

Thanks!
 
J

Jonathan Clark

Dan said:
Has anyone tried to construct an HTML janitor script using BeautifulSoup?

My situation:

I'm trying to convert a series of web pages from .html to palmdoc format,
using plucker, which is written in python. The plucker project suggests
passing html through "tidy", to get well-formed html for plucker to work
with.

However, some of the pages I want to convert are so bad that even tidy
pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad
html... Which led me to the question this article started out with. :)

Thanks!

I have used BeautifulSoup for screen scraping, pulling html into
structured form (using XML). Is that similar to a janitor script? I
used it because tidy was puking on some html. BS has been excellent.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,215
Messages
2,571,113
Members
47,713
Latest member
LeliaB1379

Latest Threads

Top