encoding problems

tool69 · Aug 29, 2007

Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

# Instanciate 2 Post objects
p1 = Post()
p1.title = "First post without accented chars"
p1.content = """This is the first.
....blabla
.... end of post..."""

p2 = Post()
p2.title = "Second post with accented chars"
p2.content = """Ce poste possède des accents : é à ê è"""

for post in [p1,p2]:
print post.title, "\n" +"-"*30
print post.html_content

%-------------------------------------------------------------

The output gives me :

First post without accented chars
------------------------------
<div class="document">
<p>This is the first.
....blabla
.... end of post...</p>
</div>

Second post with accented chars
------------------------------
Traceback (most recent call last):
File "C:\Documents and
Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
<module>
print post.html_content
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
position 39:
ordinal not in range(128)

Any idea of what I've missed ?

Thanks.

Lawrence D'Oliveiro · Aug 29, 2007

p2.content = """Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨"""

My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨""".encode("utf8")

tool69 · Aug 29, 2007

Lawrence D'Oliveiro a Ã©crit :

My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨""".encode("utf8")

Thanks for your answer Lawrence, but I always got the error.
Any other idea ?

Diez B. Roggisch · Aug 29, 2007

tool69 said:
Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-

This declaration only affects unicode-literals.

from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

Did you know that you can do this like this:

@property
def html_content(self):
...

?

# Instanciate 2 Post objects
p1 = Post()
p1.title = "First post without accented chars"
p1.content = """This is the first.
...blabla
... end of post..."""

p2 = Post()
p2.title = "Second post with accented chars"
p2.content = """Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨"""

This needs to be a unicode-literal:

p2.content = u"""Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨"""

Note the u in front.

for post in [p1,p2]:
print post.title, "\n" +"-"*30
print post.html_content

%-------------------------------------------------------------

The output gives me :

First post without accented chars
------------------------------
<div class="document">
<p>This is the first.
...blabla
... end of post...</p>
</div>

Second post with accented chars
------------------------------
Traceback (most recent call last):
File "C:\Documents and
Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
<module>
print post.html_content
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
position 39:
ordinal not in range(128)

You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So

print post.html_content.encodec("utf-8")

should work.

Diez

tool69 · Aug 29, 2007

Diez B. Roggisch a Ã©crit :

tool69 said:
tool69 said:

Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Click to expand...

This declaration only affects unicode-literals.

from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

Click to expand...

Did you know that you can do this like this:

@property
def html_content(self):
...

?

I only took some part of code from someone else
(an old TurboGears tutorial if I remember).

But you're right : decorators are better.

This needs to be a unicode-literal:

p2.content = u"""Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨"""

Note the u in front.

You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So

print post.html_content.encodec("utf-8")

should work.

That solved it : thank you so much.

Guest · Aug 29, 2007

Lawrence said:
My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possÃ¨de des accents : Ã© Ã Ãª Ã¨""".encode("utf8")

is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'Ã ' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?

Diez B. Roggisch · Aug 29, 2007

Ricardo said:
is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'Ã ' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?

First of all: please don't hijack threads. Start a new one with your
specific question.

Second: this might be what you are looking for:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Didn't try it myself though.

Diez

Damjan · Aug 29, 2007

is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'Ã ' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?

After setting the locale...

locale.strcoll()

RXParse module v.90 (by robic0)	0	May 29, 2006
[ANN] Rails 0.6.5 (AR 0.9.4, AP 0.8.0): Release of Contributors!	15	Aug 20, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

encoding problems

tool69

Lawrence D'Oliveiro

tool69

Diez B. Roggisch

tool69

Guest

Diez B. Roggisch

Damjan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads