N
Nicolas Evrard
Hello,
I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?
..nicoe@smarties:~$ python2.4
..Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19)
..[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
..Type "help", "copyright", "credits" or "license" for more information.
..>>> import formatter
..>>> import htmllib
..>>> html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
..>>> html2txt.feed(u'D\xe9but')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
.. self.goahead(0)
.. File "/usr/lib/python2.4/sgmllib.py", line 120, in goahead
.. self.handle_data(rawdata[i:j])
.. File "/usr/lib/python2.4/htmllib.py", line 65, in handle_data
.. self.formatter.add_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 197, in add_flowing_data
.. self.writer.send_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 421, in send_flowing_data
.. write(word)
..UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
..>>> html2txt.feed(u'D\xe9but'.encode('latin1'))
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
..>>> html2txt.feed('Début')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
..>>>
I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?
..nicoe@smarties:~$ python2.4
..Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19)
..[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
..Type "help", "copyright", "credits" or "license" for more information.
..>>> import formatter
..>>> import htmllib
..>>> html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
..>>> html2txt.feed(u'D\xe9but')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
.. self.goahead(0)
.. File "/usr/lib/python2.4/sgmllib.py", line 120, in goahead
.. self.handle_data(rawdata[i:j])
.. File "/usr/lib/python2.4/htmllib.py", line 65, in handle_data
.. self.formatter.add_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 197, in add_flowing_data
.. self.writer.send_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 421, in send_flowing_data
.. write(word)
..UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
..>>> html2txt.feed(u'D\xe9but'.encode('latin1'))
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
..>>> html2txt.feed('Début')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
..>>>