encoding problems

T

tool69

Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

# Instanciate 2 Post objects
p1 = Post()
p1.title = "First post without accented chars"
p1.content = """This is the first.
....blabla
.... end of post..."""

p2 = Post()
p2.title = "Second post with accented chars"
p2.content = """Ce poste possède des accents : é à ê è"""

for post in [p1,p2]:
print post.title, "\n" +"-"*30
print post.html_content

%-------------------------------------------------------------

The output gives me :

First post without accented chars
------------------------------
<div class="document">
<p>This is the first.
....blabla
.... end of post...</p>
</div>

Second post with accented chars
------------------------------
Traceback (most recent call last):
File "C:\Documents and
Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
<module>
print post.html_content
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
position 39:
ordinal not in range(128)

Any idea of what I've missed ?

Thanks.
 
L

Lawrence D'Oliveiro

p2.content = """Ce poste possède des accents : é à ê è"""

My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")
 
T

tool69

Lawrence D'Oliveiro a écrit :
My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")

Thanks for your answer Lawrence, but I always got the error.
Any other idea ?
 
D

Diez B. Roggisch

tool69 said:
Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-


This declaration only affects unicode-literals.
from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

Did you know that you can do this like this:

@property
def html_content(self):
...

?
# Instanciate 2 Post objects
p1 = Post()
p1.title = "First post without accented chars"
p1.content = """This is the first.
...blabla
... end of post..."""

p2 = Post()
p2.title = "Second post with accented chars"
p2.content = """Ce poste possède des accents : é à ê è"""


This needs to be a unicode-literal:

p2.content = u"""Ce poste possède des accents : é à ê è"""

Note the u in front.
for post in [p1,p2]:
print post.title, "\n" +"-"*30
print post.html_content

%-------------------------------------------------------------

The output gives me :

First post without accented chars
------------------------------
<div class="document">
<p>This is the first.
...blabla
... end of post...</p>
</div>

Second post with accented chars
------------------------------
Traceback (most recent call last):
File "C:\Documents and
Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
<module>
print post.html_content
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
position 39:
ordinal not in range(128)

You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So

print post.html_content.encodec("utf-8")

should work.

Diez
 
T

tool69

Diez B. Roggisch a écrit :
tool69 said:
Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-------------------------------------------------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-


This declaration only affects unicode-literals.
from docutils.core import publish_parts

class Post(object):
def __init__(self, title='', content=''):
self.title = title
self.content = content

def _get_html_content(self):
return publish_parts(self.content,
writer_name="html")["html_body"]
html_content = property(_get_html_content)

Did you know that you can do this like this:

@property
def html_content(self):
...

?

I only took some part of code from someone else
(an old TurboGears tutorial if I remember).

But you're right : decorators are better.
This needs to be a unicode-literal:

p2.content = u"""Ce poste possède des accents : é à ê è"""

Note the u in front.



You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So

print post.html_content.encodec("utf-8")

should work.

That solved it : thank you so much.
 
G

Guest

Lawrence said:
My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")

is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?
 
D

Diez B. Roggisch

Ricardo said:
is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?

First of all: please don't hijack threads. Start a new one with your
specific question.

Second: this might be what you are looking for:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Didn't try it myself though.

Diez
 
D

Damjan

is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?

After setting the locale...

locale.strcoll()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top