A
Andrew Fong
I need to ...
1) Truncate long unicode (UTF-8) strings based on their length in
BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
(u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.
2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
opposed to getting UnicodeDecodeError.
I'm using Python2.6, so I have access to things like bytearray. Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?
-- Andrew
1) Truncate long unicode (UTF-8) strings based on their length in
BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
(u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.
2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
opposed to getting UnicodeDecodeError.
I'm using Python2.6, so I have access to things like bytearray. Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?
-- Andrew