Buffering HTML as HTMLParser reads it?

C

chrispwd

Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!
 
P

Paul McGuire

On Aug 1, 1:31 pm, (e-mail address removed) wrote:
I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!

What you describe is almost exactly how pyparsing implements
transformString. See below:

from pyparsing import *

boldStart,boldEnd = makeHTMLTags("B")

# convert <B> to <div class="bold"> and </B> to </div>
boldStart.setParseAction(replaceWith('<div class="emphatic">'))
boldEnd.setParseAction(replaceWith('</div>'))
converter = boldStart | boldEnd

html = "Display this in <b>bold</b>"
print converter.transformString(html)

Prints:

Display this in <div class="emphatic">bold</div>

All text not matched by a pattern in the converter is left as-is. (My
CSS style/form may not be up to date, but I hope you get the idea.)

-- Paul
 
C

chrispwd

On Aug 1, 1:31 pm, (e-mail address removed) wrote:
<snip>






What you describe is almost exactly how pyparsing implements
transformString. See below:

from pyparsing import *

boldStart,boldEnd = makeHTMLTags("B")

# convert <B> to <div class="bold"> and </B> to </div>
boldStart.setParseAction(replaceWith('<div class="emphatic">'))
boldEnd.setParseAction(replaceWith('</div>'))
converter = boldStart | boldEnd

html = "Display this in <b>bold</b>"
print converter.transformString(html)

Prints:

Display this in <div class="emphatic">bold</div>

All text not matched by a pattern in the converter is left as-is. (My
CSS style/form may not be up to date, but I hope you get the idea.)

-- Paul

Hello,

Sorry for the delay in reply, and that you for the info. Though, I
think either I am mis-understanding your post or its not the solution
I'm looking for.

How does this fit into what I'm looking to do with HTMLParser?

Thanks!
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

Works the same with any sax (event-based) parser. First subclass the
parser, adding a 'buffer' (best is to use a file-like object so you can
either write to a stream, a file, a cStringIO etc) attribute to it and
making all the handlers writing to this buffer. Then subclass your
customized parser, and only override the needed handlers.

Q&D example implementation:

def format_attrs(attrs) :
return ' '.join('%s=%s' % attr for attr in attrs)

def format_tag(tag, attrs, formats):
attrs = format_attrs(attrs)
return formats[bool(attrs)] % dict(tag=tag, attrs=attrs)

class BufferedHTMLParser(HTMLParser):
START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>')
STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />')

def __init__(self, buffer):
self.buffer = buffer

def handle_starttag(self, tag, attrs):
self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS))

def handle_startendtag(self, tag):
self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS))

def handle_endtag(self, tag):
self.buffer.write('</%s> % tag)

def handle_data(self, data):
self.buffer.write(data)

# etc for all handlers


class MyParser(BufferedHtmlParser):
def handle_data(self, data):
data = data.replace(
'Ni',
"Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm"
)
BufferedHTMLParser.handle_data(self, data)

HTH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top