Can I get the 8bit-string representation of any unicode string

wanghz · Feb 12, 2006

Hello, everyone.

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Suppose I get a unicode string:
a = u'\xc8\xce\xcf\xcd\xc6\xeb';
then, by
a.encode('latin-1');
I can get the 8bit-string representation of it, that is, the physical
storage format of this string.

But for another kind of unicode string, say:
b = u'\u4efb\u8d24\u9f50';
I have to:
b.encode('utf-8')
to get the 8bit-string format of it.

Since these unicode strings are given by an external library function,
I don't know which kind a unicode string belongs to before I get it at
runtime. So, I wonder if there is a unified way to get the 8bit-string
representation, say, byte-by-byte, of any unicode string?

Thank you very much.

Kent Johnson · Feb 12, 2006

Hello, everyone.

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Yes, if you can be more precise about what you mean by '8bit-string
representation'. Likely candidates are
b.encode('utf-8')
b.encode('utf_16_be')
b.encode('utf_16_le')

Kent

Fredrik Lundh · Feb 12, 2006

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Suppose I get a unicode string:
a = u'\xc8\xce\xcf\xcd\xc6\xeb';
then, by
a.encode('latin-1');
I can get the 8bit-string representation of it, that is, the physical
storage format of this string.

But for another kind of unicode string, say:
b = u'\u4efb\u8d24\u9f50';
I have to:
b.encode('utf-8')
to get the 8bit-string format of it.

latin-1 and utf-8 are two different 8-bit representations (encodings) of
Unicode.

Since these unicode strings are given by an external library function,
I don't know which kind a unicode string belongs to before I get it at
runtime. So, I wonder if there is a unified way to get the 8bit-string
representation, say, byte-by-byte, of any unicode string?

since the Unicode character set contains 1.1 million code points, and a
single byte can contain 256 different values, it should be fairly obvious
that there's no "8 bit byte by byte" representation of a Unicode string.
you need to decide what 8-bit encoding to use, and stick to that.

</F>

wanghz · Feb 12, 2006

Thank you all for your replies

I may misunderstood it. I will think about it carefully.

By the way, does python has a interface, just like iconv in libc for
C/C++? Or, how can I convert a string from a encoding into another
one?

Thank you so much.

Fredrik Lundh · Feb 12, 2006

(e-mail address removed) wrote

I may misunderstood it. I will think about it carefully.

By the way, does python has a interface, just like iconv in libc for
C/C++? Or, how can I convert a string from a encoding into another
one?

if b is an 8-bit string containing an encoded unicode string,

u = b.decode(encoding)

or

u = unicode(b, encoding)

gives you a unicode string. to encode the unicode string back to another
byte string, use the encode method.

b = u.encode(encoding)

</F>

wanghz · Feb 12, 2006

Hi,

I see. Thank you for your help!

Regards,
hongzheng

Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
How can I rectify a string that begins and ends with a backtick, while ensuring that the contents of the evaluated expression also contain backticks?	2	Mar 20, 2024
string to unicode	0	Aug 15, 2011
validate string representation of a timedelta	2	Jun 29, 2010
How can I format unicode strings?	4	Sep 9, 2009
Thinking Unicode	0	Aug 8, 2013
I need help with my python assignment and I'm stuck can't find any solution for it. Convert CSV string format to JSON format	0	Oct 12, 2021
byte count unicode string	1	Sep 20, 2006

Can I get the 8bit-string representation of any unicode string

wanghz

Kent Johnson

Fredrik Lundh

wanghz

Fredrik Lundh

wanghz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads