Buffering HTML as HTMLParser reads it?

chrispwd · Aug 1, 2007

Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!

Paul McGuire · Aug 1, 2007

On Aug 1, 1:31 pm, (e-mail address removed) wrote:

I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!

What you describe is almost exactly how pyparsing implements
transformString. See below:

from pyparsing import *

boldStart,boldEnd = makeHTMLTags("B")

# convert <B> to <div class="bold"> and </B> to </div>
boldStart.setParseAction(replaceWith('<div class="emphatic">'))
boldEnd.setParseAction(replaceWith('</div>'))
converter = boldStart | boldEnd

html = "Display this in <b>bold</b>"
print converter.transformString(html)

Prints:

Display this in <div class="emphatic">bold</div>

All text not matched by a pattern in the converter is left as-is. (My
CSS style/form may not be up to date, but I hope you get the idea.)

-- Paul

chrispwd · Aug 5, 2007

On Aug 1, 1:31 pm, (e-mail address removed) wrote:
<snip>

What you describe is almost exactly how pyparsing implements
transformString. See below:

from pyparsing import *

boldStart,boldEnd = makeHTMLTags("B")

# convert <B> to <div class="bold"> and </B> to </div>
boldStart.setParseAction(replaceWith('<div class="emphatic">'))
boldEnd.setParseAction(replaceWith('</div>'))
converter = boldStart | boldEnd

html = "Display this in <b>bold</b>"
print converter.transformString(html)

Prints:

Display this in <div class="emphatic">bold</div>

All text not matched by a pattern in the converter is left as-is. (My
CSS style/form may not be up to date, but I hope you get the idea.)

-- Paul

Hello,

Sorry for the delay in reply, and that you for the info. Though, I
think either I am mis-understanding your post or its not the solution
I'm looking for.

How does this fit into what I'm looking to do with HTMLParser?

Thanks!

Bruno Desthuilliers · Aug 6, 2007

(e-mail address removed) a écrit :

Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

Works the same with any sax (event-based) parser. First subclass the
parser, adding a 'buffer' (best is to use a file-like object so you can
either write to a stream, a file, a cStringIO etc) attribute to it and
making all the handlers writing to this buffer. Then subclass your
customized parser, and only override the needed handlers.

Q&D example implementation:

def format_attrs(attrs) :
return ' '.join('%s=%s' % attr for attr in attrs)

def format_tag(tag, attrs, formats):
attrs = format_attrs(attrs)
return formats[bool(attrs)] % dict(tag=tag, attrs=attrs)

class BufferedHTMLParser(HTMLParser):
START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>')
STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />')

def __init__(self, buffer):
self.buffer = buffer

def handle_starttag(self, tag, attrs):
self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS))

def handle_startendtag(self, tag):
self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS))

def handle_endtag(self, tag):
self.buffer.write('</%s> % tag)

def handle_data(self, data):
self.buffer.write(data)

# etc for all handlers

class MyParser(BufferedHtmlParser):
def handle_data(self, data):
data = data.replace(
'Ni',
"Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm"
)
BufferedHTMLParser.handle_data(self, data)

HTH

Python client/server that reads HTML body from server	1	Apr 12, 2023
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser not parsing whole html file	4	Oct 24, 2010
Turning HTMLParser into an iterator	0	Jun 1, 2009
I want to Display Excel As HTML In js	2	Feb 24, 2023
Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.	1	Jul 7, 2006
HTMLParser problems.	11	Oct 30, 2003
Buffering object	8	Jan 27, 2011

Buffering HTML as HTMLParser reads it?

chrispwd

Paul McGuire

chrispwd

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads