Unicode and MoinMoin

G

gdetre

Dear all,

My lab has been using a Movable Type blog for internal communication
and announcement for a couple of years, but we've now seen the light
and I've set up a MoinMoin wiki. Everything's installed beautifully, so
I'm writing scripts to export all our Movable Type blog posts to wiki
pages. So far so good.

The only issue I'm having relates to Unicode. MoinMoin and python are
pretty unforgiving about files that contain Unicode characters that
aren't included in the coding properly. I've spent hours reading about
Unicode, and playing with different encoding/decoding commands, but at
this point, I just want a hacky solution that will ignore the
improperly coded characters or replace them with placeholders.

Can anyone recommend a simple surefire Unix/Python/Perl command that
will help me avoid errors like the one below? Any suggestions would be
hugely appreciated.

Thank you very much for your time,

Yours,
Greg

----

'utf8' codec can't decode byte 0x96 in position 4910: unexpected code
byte

* args = ('utf8', 'AUTHOR: blahblah\n\nTITLE: Reading Course
Readings... G. A. \x96 For references see blahblah.\n\n\n-----\n\n',
4910, 4911, 'unexpected code byte')
* encoding = 'utf8'
* end = 4911
* object = 'AUTHOR: blahblah\n\nTITLE: Reading Course Readings...
G. A. \x96 For references see blahblah.\n\n\n-----\n\n'
* reason = 'unexpected code byte'
* start = 4910
 
N

Neil Hodgson

Greg:
The only issue I'm having relates to Unicode. MoinMoin and python are
pretty unforgiving about files that contain Unicode characters that
aren't included in the coding properly. I've spent hours reading about
Unicode, and playing with different encoding/decoding commands, but at
this point, I just want a hacky solution that will ignore the
improperly coded characters or replace them with placeholders.

Call the codec with the errors argument set to "ignore" or "replace".
A. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "c:\python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 58:
unexpected code byteA. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8', 'replace')
u'AUTHOR: blahblah\n\nTITLE: Reading Course Readings... G. A. \ufffd For
references see blahblah.\n\n\n-----\n\n'

BTW, its probably in Windows-1252 where it would be a dash.
Depending on your context it may pay to handle the exception instead of
using "replace" and attempt interpreting as Windows-1252.

Neil
 
F

Fredrik Lundh

Neil said:
Call the codec with the errors argument set to "ignore" or "replace".

A. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "c:\python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 58:
unexpected code byte
A. \x96 For references see blahblah.\n\n\n-----\n\n', 'utf8', 'replace')
u'AUTHOR: blahblah\n\nTITLE: Reading Course Readings... G. A. \ufffd For
references see blahblah.\n\n\n-----\n\n'

BTW, its probably in Windows-1252 where it would be a dash.
Depending on your context it may pay to handle the exception instead of
using "replace" and attempt interpreting as Windows-1252.

here's one way to explicitly deal with 1252 gremlins:

http://effbot.org/zone/unicode-gremlins.htm

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top