Unicode characters

P

Paul Johnston

Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.


Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

Cheers Paul
 
L

limodou

Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.


Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2
If the string is not a unicode, it's be encoded in byte, so you can
only get the every character encoding of the string. You can conver it
to unicode, and if the character value less than 127, it should be an
ascii, otherwise maybe a multibytes character. for example:

a = 'string'
b = unicode(a, encoding_according_your_situation)
for i in b:
if ord(i) < 127:
print ord(i), 'ascii'
else:
print ord(i), 'multibytes'
 
D

Diez B. Roggisch

Paul said:
Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.


Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

Use unicode objects instead of byte strings. The above string literal is
_not_ affected by the coding:-header whatsoever.

That applies only to

u"some text"

literals, and makes them a unicode object.

The normal string literals are just bytes - because of your encoding being
properly set in the editor, an entered multibyte-character is stored as
such.

In a nutshell: try the above using u"abcd".
Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top