html DOM

S

Sam the Cat

Is there a package that would allow me the same or similar functionality for
modifying html code via the DOM model as I have in JavaScript ? I'd like to
parse an html file, then modify it and save the result. I am not trying to
do this online, rather I would like to do this on a batch of files stored on
my hard drive. I have found several packages that allow me to parse and
dissect html but none that allow me to modify the object and save the
results -- perhaps I am overlooking the obvious
 
G

Gabriel Genellina

En Sun, 30 Mar 2008 00:19:08 -0300, Michael Wieher
Was this not of any use?

http://www.boddie.org.uk/python/HTML.html

I think, since HTML is a sub-set of XML, any XML parser could be adapted
to
do this...

That's not true. A perfectly valid HTML document might even not be well
formed XML; some closing tags are not mandatory, attributes may not be
quoted, tags may be written in uppercase, etc. Example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<HTML><TITLE>Invalid xml</title><p Id=Abc>a</html>

The above document validates with no errors on http://validator.w3.org
If you are talking about XHTML documents, yes, they *should* be valid XML
documents.
I doubt there's an HTML-specific version, but I would imagine you
could wrap any XML parser, or really, create your own that derives from
the
XML-parser-class...

The problem is that many HTML and XHTML pages that you find on the web
aren't valid, some are ridiculously invalid. Browsers have a "quirks"
mode, and can imagine/guess more or less the writer's intent only because
HTML tags have some meaning. A generic XML parser, on the other hand,
usually just refuses to continue parsing an ill-formed document. You can't
simply "adapt any XML parser to to that".

BeautifulSoup, by example, does a very good job trying to interpret and
extract some data from the "tag soup", and may be useful to the OP.
http://www.crummy.com/software/BeautifulSoup/
 
D

David

Is there a package that would allow me the same or similar functionality for
modifying html code via the DOM model as I have in JavaScript ? I'd like to
parse an html file, then modify it and save the result. I am not trying to
do this online, rather I would like to do this on a batch of files stored on
my hard drive. I have found several packages that allow me to parse and
dissect html but none that allow me to modify the object and save the
results -- perhaps I am overlooking the obvious

Have you looked at Beautiful Soup?

http://www.crummy.com/software/BeautifulSoup/

David.
 
P

Paul Boddie

Is there a package that would allow me the same or similar functionality for
modifying html code via the DOM model as I have in JavaScript ? I'd like to
parse an html file, then modify it and save the result.

You could try libxml2dom which has an HTML parsing mode (like lxml and
other solutions based on libxml2):

http://www.python.org/pypi/libxml2dom

It attempts to provide a DOM API very much like that used by
JavaScript implementations.

Paul
 
S

Stefan Behnel

Sam said:
Is there a package that would allow me the same or similar functionality
for modifying html code via the DOM model as I have in JavaScript ? I'd
like to parse an html file, then modify it and save the result. I am
not trying to do this online, rather I would like to do this on a batch
of files stored on my hard drive. I have found several packages that
allow me to parse and dissect html but none that allow me to modify the
object and save the results -- perhaps I am overlooking the obvious

http://codespeak.net/lxml/lxmlhtml.html

Here are some performance comparisons of HTML parsers:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,270
Members
46,874
Latest member
CyberGateway

Latest Threads

Top