HTML -> text/plain "clever" formatting

G

Gilles Lenfant

Hi,

I make an app where I need to convert HTML to text in a "clever" way (means
it tries to mimic when possible a browser rendering).
Actually I spawn with popen2 a "lynx" that makes a perfect job.

But I need a 100% pythonic stuff to have my app (Zope product) running on
non Unix boxes.

Any hint ?

Thanks in advance.

--Gilles
 
K

Karl Scalet

Gilles said:
Hi,

I make an app where I need to convert HTML to text in a "clever" way (means
it tries to mimic when possible a browser rendering).
Actually I spawn with popen2 a "lynx" that makes a perfect job.

i've seen an example in python cookbook, but do not have the copy with
me right now. If you have that book, look at the example to send HTML
mails. keywords are htmllib and formatter/AbstractFormatter.
IIRC they mention it's not that perfect as doing the same with lynx.
Don't know if the onlineversion at activestate does have this example.
But I need a 100% pythonic stuff to have my app (Zope product) running on
non Unix boxes.

That would be 100% pythonic even w/o external stuff.
Any hint ?

Thanks in advance.

--Gilles

HTH Karl
 
G

Gilles Lenfant

Karl Scalet said:
i've seen an example in python cookbook, but do not have the copy with
me right now. If you have that book, look at the example to send HTML
mails. keywords are htmllib and formatter/AbstractFormatter.
IIRC they mention it's not that perfect as doing the same with lynx.
Don't know if the onlineversion at activestate does have this example.


That would be 100% pythonic even w/o external stuff.


HTH Karl

Thanks Karl, I found what you're talking about

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297

Need to rework that TtyFormatter in depth to mimic lynx :eek:)

Cheers

--Gilles
 
M

Michel Claveau/Hamster

Bonjour !

Tente l'exemple de code "maison" ci-dessous.

@-salutations
--
Michel Claveau
mél : http://cerbermail.com/?6J1TthIa8B





# -*- coding: cp1252 -*-

import cStringIO
import formatter
import urllib
import htmllib

def htdecode(a):
f=cStringIO.StringIO()
z=formatter.AbstractFormatter(formatter.DumbWriter(f))
p=htmllib.HTMLParser(z)
p.feed(urllib.unquote_plus(a))
p.close()
sret=f.getvalue()
f.close()
return(sret)


a="""<HTML><BODY><B> Bonjour%20!%20<BR>
Ligne 2</B></BODY></HTML>"""

print '\n--- en HTML','-'*30
print a

b=htdecode(a)

print '\n\n--- sans HTML','-'*28
print b
 
J

John J. Lee

Gilles Lenfant said:
But I need a 100% pythonic stuff to have my app (Zope product) running on
non Unix boxes.

Lynx runs on Windows too (and lots of other platforms). It is a big
pile of code, though, possibly containing buffer overflows etc...


John
 
K

Karl Scalet

Karl said:
Hi Gilles

actually I was talking about a different even similar example.
But could not find it either in the online version. So maybe
this is available only in the printed version, sorry.
But if your recipe is good enough , why bother :)

Karl

finally found the online version under:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/67083
but this version deviates from the paper version in exactely
not having the code you're interessted in, so I'll add it here,
hopefully no one's complaining about this few lines:

import htmllib, formatter, cStringIO
textout = cStringIO.StringIO()
formtext = formatter.AbstractFormatter(formatter.DumbWriter(textout))
parser = htmllib.HTMLParser(formtext)
parser.feed(html)
parser.close()
text = textout.getvalue()

(not tested)

Karl
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,899
Members
47,440
Latest member
YoungBorel

Latest Threads

Top