unicode converting

M

Maxim Kasimov

there are a few questions i can find answer in manual:
1. how to define which is internal encoding of python unicode strings (UTF-8, UTF-16 ...)
2. how to convert string to UCS-2

(Python 2.2.3 on freebsd4)
 
D

Diez B. Roggisch

Maxim said:
there are a few questions i can find answer in manual:
1. how to define which is internal encoding of python unicode strings
(UTF-8, UTF-16 ...)

It shouldn't be your concern - but you can specify it using " ./configure
--enable-unicode=ucs2" or --enable-unicode=ucs4. You can't set it to utf-8
or utf-16.
2. how to convert string to UCS-2

s = ... # some ucs-2 string
s.decode("utf-16")

might give you the right results for most cases:

http://mail.python.org/pipermail/python-dev/2002-May/024193.html
 
M

Maxim Kasimov

Diez said:
Maxim Kasimov wrote:




It shouldn't be your concern - but you can specify it using " ./configure
--enable-unicode=ucs2" or --enable-unicode=ucs4. You can't set it to utf-8
or utf-16.
is that means that python internal unicode format is ucs2 or ucs4?
i'm concerning with the qustion because i need to send data to external
application in ucs2 encoding
s = ... # some ucs-2 string
s.decode("utf-16")
not _from_ ucs2, but _to_ ucs2, for example:
s = ... # some utf-16 string
d = encode_to_ucs2(s)
 
C

Christos TZOTZIOY Georgiou

is that means that python internal unicode format is ucs2 or ucs4?
i'm concerning with the qustion because i need to send data to external
application in ucs2 encoding

If unicode_data references your unicode data, all you have to send is:

unicode_data.encode('utf-16') # maybe utf-16be for network order

You should not care about internal encoding of unicode objects.
 
L

Leif K-Brooks

Maxim said:
is that means that python internal unicode format is ucs2 or ucs4?
i'm concerning with the qustion because i need to send data to external
application in ucs2 encoding

The internal format Python stores Unicode strings in is an
implementation detail; it has nothing to do with how you send data. To
do that, you encode your string into a suitable encoding:
'\xff\xfeS\x00o\x00m\x00e\x00 \x00U\x00n\x00i\x00c\x00o\x00d\x00e\x00
\x00t\x00e\x00x\x00t\x00.\x00'
 
M

Maxim Kasimov

Christos said:
If unicode_data references your unicode data, all you have to send is:

unicode_data.encode('utf-16') # maybe utf-16be for network order
is utf-16 string the same ucs-2? my question is how to get string encoded as UCS-2
 
S

Serge Orlov

Maxim said:
is utf-16 string the same ucs-2? my question is how to get string
encoded as UCS-2

utf-16 is basically a superset of ucs-2. See here for more detail:
http://www.azillionmonkeys.com/qed/unicode.html
If you ensure that ord() of each output character is < 0x10000
you'll get valid ucs-2 output if you use utf-16 encoding. If you
build python with --enable-unicode=ucs2 no character can be >= 0x10000
so you don't have to check. On the other 1) you won't be able even to
input characters >= 0x10000 into your application and 2) premature
optimization is bad and 3) There is a note in README: To compile
Python2.3 with Tkinter, you will need to pass --enable-unicode=ucs4
flag to ./configure

Serge.
 
C

Christos TZOTZIOY Georgiou

3) There is a note in README: To compile
Python2.3 with Tkinter, you will need to pass --enable-unicode=ucs4
flag to ./configure

I thought this applied to Tkinter as pre-built on recent RedHat systems. Does
it also apply to FreeBSD? On Windoze, Mandrake and SuSE python has UCS-2
unicode and Tkinter is working just fine.
 
M

Maxim Kasimov

Serge said:
utf-16 is basically a superset of ucs-2. See here for more detail:
http://www.azillionmonkeys.com/qed/unicode.html
If you ensure that ord() of each output character is < 0x10000
you'll get valid ucs-2 output if you use utf-16 encoding. If you
build python with --enable-unicode=ucs2 no character can be >= 0x10000
so you don't have to check.

thank you very match! that's exactly what i need
 
S

Serge Orlov

Christos said:
I thought this applied to Tkinter as pre-built on recent RedHat
systems. Does it also apply to FreeBSD?

I don't know. I didn't notice that it was about RedHat.
On Windoze, Mandrake and SuSE python has UCS-2
unicode and Tkinter is working just fine.

Did you build python on Mandrake and SuSE yourself? I had an impression
that ucs-4 builds are prefered on Linux. At least python on RedHat EL3
and SUSE ES9 is built with --enable-unicode=ucs4.

Serge.
 
C

Christos TZOTZIOY Georgiou

Did you build python on Mandrake and SuSE yourself? I had an impression
that ucs-4 builds are prefered on Linux. At least python on RedHat EL3
and SUSE ES9 is built with --enable-unicode=ucs4.

tzot@tril/home/tzot/tmp
$ py
Python 2.4 (#8, Mar 2 2005, 11:12:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
..>> import sys
..>> sys.maxunicode
65535
..>> import Tkinter
..>>
You have new mail in /var/mail/tzot
tzot@tril/home/tzot/tmp
$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
..>> import sys, Tkinter
..>> sys.maxunicode
1114111
..>>

2.4 built by me, 2.3.3 by SuSE.

I see. So on SuSE 9.1 professional too, Python and Tcl/Tk are pre-built with
ucs-4. My Mandrake installation is at home and I can't check now. Sorry for
the misinformation about SuSE.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,222
Messages
2,571,142
Members
47,757
Latest member
PDIJaclyn

Latest Threads

Top