How do I display unicode value stored in a string variable using ord()

P

Paul Rubin

Chris Angelico said:
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal. It gets more expensive if you
want to index far more deeply into the string. I'm asking how often
that is done in real code. Obviously one can concoct hypothetical
examples that would suffer.
 
C

Chris Angelico

Chris Angelico said:
"asdfqwer"[4:]
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal. It gets more expensive if you
want to index far more deeply into the string. I'm asking how often
that is done in real code. Obviously one can concoct hypothetical
examples that would suffer.

Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA
 
P

Paul Rubin

Chris Angelico said:
Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.
I don't have a Python example of parsing a huge string, but I've done
it in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.
 
T

Terry Reedy

print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic
real-world mixtures of operations.

People making improvements must consider performance on multiple systems
and multiple benchmarks. If someone wants to work on search speed, they
cannot just optimize that one operation on one system.
 
C

Chris Angelico

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA
 
P

Paul Rubin

Chris Angelico said:
Generally, I'm working with pure ASCII, but port those same algorithms
to Python and you'll easily be able to read in a file in some known
encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.
It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing.

I don't understand how this is supposed to work. You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.
 
S

Steven D'Aprano

This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".


Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance.

Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:

0000006c 00000065 0000006e
6c000000 65000000 6e000000

Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:

py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'


I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4
is too because ASCII-only strings like identifiers end up being four
times as big as they need to be. 1-byte schemes like Latin-1 are
unspeakable because they only handle 256 characters, fewer if you don't
count the C0 and C1 control codes.

PEP 393 to the rescue! What if you could encode pure-ASCII strings like
"len" using one byte per character, and BMP strings using two bytes per
character (UCS-2), and fall back to four bytes (UCS-4) only when you
really need it?

The benefits are:

* Americans and English-Canadians and Australians and other barbarians of
that ilk who only use ASCII save a heap of memory;

* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per
character;

* people who use lots of non-BMP characters are no worse off.

The costs are:

* string routines need to be smarter -- they have to handle three
different data structures (ASCII, UCS-2, UCS-4) instead of just one;

* there's a certain amount of overhead when creating a string -- you have
to work out which in-memory format to use, and that's not necessarily
trivial, but at least it's a once-off cost when you create the string;

* people who misunderstand what's going on get all upset over micro-
benchmarks.

Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

To recap:

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.

* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP.
That's better than it sounds: the BMP supports most character sets, but
not all. Still, there are people who need the supplementary planes, and
UCS-2 lets them down.

* Fixed width UCS-4 does handle the full Unicode range, without
surrogates, but at the cost of using 2-4 times more string memory for the
vast majority of users.

* PEP 393 doesn't use variable-width characters, but variable-width
strings. Instead of choosing between 1, 2 and 4 bytes per character, it
chooses *per string*. This keeps basic string operations O(1) instead of
O(N), saves memory where possible, while still supporting the full
Unicode range without a compile-time option.
 
S

Steven D'Aprano

No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have
unlimited amounts of memory. You must be very lucky.

The same story as the coding of text files, where "utf-8 == ascii" and
the rest of the world doesn't count.

UTF-8 is not ASCII.
 
S

Steven D'Aprano

As I understand (I think) the undelying mechanism, I can only say, it is
not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is saved as ascii,
then I type en "é", the text can only be saved in at least latin-1. Then
I enter an "€", the text become an internal ucs-4 "string". The remove
the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and
so is nearly every character you're ever going to use unless you are
Asian or a historian using some obscure ancient script. NONE of the
examples you have shown in your emails have included 4-byte characters,
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and
misinterpreting what you have seen.


In *both* Python 3.2 and 3.3, both é and € are represented by two bytes.
That will not change. There is a tiny amount of fixed overhead for
strings, and that overhead is slightly different between the versions,
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text
that you type is not the same as how Python does it. A text editor is not
going to be creating a new immutable string after every key press. That
will be slow slow SLOW. The usual way is to keep a buffer for each
paragraph, and add and subtract characters from the buffer.

Intuitively I expect there is some kind slow down between all these
"strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all
we do is create new strings. First we create a string 'ab…', then we
create another string 'ab…'*1000, then we create two new strings '…' and
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just
immediately create a new one and throw the old one away. You likely do
work with that string:

steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of
deciding whether they should be stored using 1, 2 or 4 bytes begins to
fade into the noise.

When I tested this flexible representation, a few months ago, at the
first alpha release. This is precisely what, I tested. String
manipulations which are forcing this internal change and I concluded the
result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable
slow-down on Windows, report it as a bug.

Does any body know a way to get the size of the internal "string" in
bytes?

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038


As I said, there is a *tiny* overhead difference. But identifiers will
generally be smaller:

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty
string.
 
S

Steven D'Aprano

The change does not just benefit ASCII users. It primarily benefits
anybody using a wide unicode build with strings mostly containing only
BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings
containing non-BMP characters, then you will see a big benefit.

Even for narrow build users, there is the benefit that
with approximately the same amount of memory usage in most cases, they
no longer have to worry about non-BMP characters sneaking in and
breaking their code.

Yes! +1000 on that.

There is some additional benefit for Latin-1 users, but this has nothing
to do with Python. If Python is going to have the option of a 1-byte
representation (and as long as we have the flexible representation, I
can see no reason not to),

The PEP explicitly states that it only uses a 1-byte format for ASCII
strings, not Latin-1:

"ASCII-only Unicode strings will again use only one byte per character"

and later:

"If the maximum character is less than 128, they use the PyASCIIObject
structure"

and:

"The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient)."

then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number
of 1-byte encodings, Latin-1 is hardly the only one.

because that's what 1-byte Unicode (UCS-1, if you will) is. If you have
an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard "1-byte Unicode"
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too
many to fit in a single byte. There is some historical justification for
using "Unicode" to mean UCS-2, but with the standard being extended
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers
deliberately matched the Latin-1 standard for Unicode's first 256 code
points. That's not the same thing though: there is no Unicode standard
mapping to a single byte format.
 
S

Steven D'Aprano

"a" will be stored as 1 byte/codepoint.

Adding "é", it will still be stored as 1 byte/codepoint.

Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1
internally, it does not. Read the PEP, it explicitly states that 1-byte
formats are only used for ASCII strings.

Adding "€", it will still be stored as 2 bytes/codepoint.

That is correct.
 
S

Steven D'Aprano

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.

That's the *least* of the problems with surrogate pairs. That would be
easy to fix: check the point of the slice, and back up or forward if
you're on a surrogate pair. But that's not good enough, because the
surrogates could be anywhere in the string. You have to touch every
single character in order to know how many there are.

The problem with surrogate pairs is that they make basic string
operations O(N) instead of O(1).
 
P

Peter Otten

Steven said:
Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1
internally, it does not. Read the PEP, it explicitly states that 1-byte
formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51)
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
import sys
[sys.getsizeof("é"*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
[sys.getsizeof("e"*i) for i in range(10)] [49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
sys.getsizeof("é"*101)-sys.getsizeof("é") 100
sys.getsizeof("e"*101)-sys.getsizeof("e") 100
sys.getsizeof("€"*101)-sys.getsizeof("€")
200

I infer that

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system)
over ASCII-only.
 
S

Steven D'Aprano

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal. It gets more expensive if you
want to index far more deeply into the string. I'm asking how often
that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through
it looking for a match. Let's ignore the regular expression engine, since
it has to look at every character anyway. But you've done your search and
found your matching text and now want everything *after* it. That's not
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]


Easy-peasy, right? But behind the scenes, you have a problem: how does
Python know where text[end:] starts? With fixed-size characters, that's
O(1): Python just moves forward end*width bytes into the string. Nice and
fast.

With a variable-sized characters, Python has to start from the beginning
again, and inspect each byte or pair of bytes. This turns the slice
operation into O(N) and the combined op (search + slice) into O(N**2),
and that starts getting *horrible*.

As always, "everything is fast for small enough N", but you *really*
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid
character boundaries doesn't help you, because the string slice method
cannot know where the indexes came from.

I suppose you could have a "fast slice" and a "slow slice" method, but
really, that sucks, and besides all that does is pass responsibility for
tracking character boundaries to the developer instead of the language,
and you know damn well that they will get it wrong and their code will
silently do the wrong thing and they'll say that Python sucks and we
never used to have this problem back in the good old days with ASCII. Boo
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For
typical users, you end up wasting memory. That is the complaint driving
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to
multiply your string memory by four just in case somebody someday gives
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably
all you'll ever need. I hear that the club for people who have all the
memory they'll ever need is holding their annual general meeting in a
phone-booth this year.

You could say "Screw the full Unicode standard, who needs more than 64K
different characters anyway?" Well apart from Asians, and historians, and
a bunch of other people. If you can control your data and make sure no
non-BMP characters are used, UCS-2 is fine -- except Python doesn't
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up
to the individual programmer to track character boundaries, and we know
how well that works. Luckily the supplementary planes are only rarely
used, and people who need them tend to buy more memory and use wide
builds. People who only need a few non-BMP characters in a narrow build
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings,
turn them into suped-up ropes-on-steroids. All those extra indexes mean
that you don't save any memory. Because the objects are so much bigger
and more complex, your CPU cache goes to the dogs and your code still
runs slow.

Which leaves us right back where we started, PEP 393.

Obviously one can concoct hypothetical examples that would suffer.

If you think "slicing at arbitrary indexes" is a hypothetical example, I
don't know what to say.
 
P

Paul Rubin

Steven D'Aprano said:
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post. I can
only hope some readers will benefit from it. I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
This standard data structure is called UCS-2 ... There's an extension
to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.

This I don't see. What are the basic string operations?

* Examine the first character, or first few characters ("few" = "usually
bounded by a small constant") such as to parse a token from an input
stream. This is O(1) with either encoding.

* Slice off the first N characters. This is O(N) with either encoding
if it involves copying the chars. I guess you could share references
into the same string, but if the slice reference persists while the
big reference is released, you end up not freeing the memory until
later than you really should.

* Concatenate two strings. O(N) either way.

* Find length of string. O(1) either way since you'd store it in
the string header when you build the string in the first place.
Building the string has to have been an O(N) operation in either
representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
get a small slice from some random place in a big string. This is
where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision. That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

py> s = chr(0xFFFF + 1)
py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error. s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered. By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string. Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it. Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.
 
P

Paul Rubin

Steven D'Aprano said:
result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number. This is O(1) if "near the end"
means "within a constant".
You could say "Screw the full Unicode standard, who needs more than 64K

No if you're claiming the language supports unicode it should be
the whole standard.
You could do what Python 3.2 narrow builds do: use UTF-16 and leave it
up to the individual programmer to track character boundaries,

I'm surprised the Python 3 implementers even considered that approach
much less went ahead with it. It's obviously wrong.
You could add a whole lot more heavyweight infrastructure to strings,
turn them into suped-up ropes-on-steroids.

I'm not persuaded that PEP 393 isn't even worse.
 
C

Chris Angelico

Steven D'Aprano said:
result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number. This is O(1) if "near the end"
means "within a constant".

Only if you know exactly where the end is (which requires storing and
maintaining a character length - this may already be happening, I
don't know). But that approach means you need to have code for both
ways (forward search or reverse), and of course it relies on your
encoding being reverse-scannable in this way (as UTF-8 is, but not
all).

And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

ChrisA
 
P

Paul Rubin

Chris Angelico said:
And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

How often is this an issue in practice?

I wonder how other languages deal with this. The examples I can think
of are poor role models:

1. C/C++ - unicode impaired, other than a wchar type

2. Java - bogus UCS-2-like(?) representation for historical reasons
Also has some modified UTF=8 for reasons that made no sense and
that I don't remember

3. Haskell - basic string type is a linked list of code points.
"hello" is five list nodes. New Data.Text library (much more
efficient) uses something like ropes, I think, with UTF-16 underneath.

4. Erlang - I think like Haskell. Efficiently handles byte blocks.

5. Perl 6 -- ???

6. Ruby - ??? (but probably quite slow like the rest of Ruby)

7. Objective C -- ???

8, 9 ... (any other important ones?)
 
W

wxjmfauth

About the exemples contested by Steven:

eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are "coded". I'm the first
to think, this is legitimate.

Memory or "ability to treat all text in the same and equal
way"?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf
 
S

Steven D'Aprano

Steven D'Aprano wrote:
I don't know where people are getting this myth that PEP 393 uses
Latin-1 internally, it does not. Read the PEP, it explicitly states
that 1-byte formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
import sys
[sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because
that would explain why your sizes are so larger than mine:

py> [sys.getsizeof("é"*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py> [sys.getsizeof("€"*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py> c = chr(0xFFFF + 1)
py> [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]


On re-reading the PEP more closely, it looks like I did misunderstand the
internal implementation, and strings which fit exactly in Latin-1 will
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top