How do I display unicode value stored in a string variable using ord()

W

wxjmfauth

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf
 
W

wxjmfauth

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf
 
C

Chris Angelico

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace lookups do, among other things.
And how is performance in the (very real) case where a C routine wants
to return a value to Python as a string, where the data is currently
guaranteed to be ASCII (previously using PyUnicode_FromString, now
able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been
gone into in great detail before the PEP was accepted (am I
negative-bikeshedding here? "atomic reactoring"???), and I'm sure that
the gains outweigh the costs.

ChrisA
 
M

Mark Lawrence

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Proof that is acceptable to everybody please, not just yourself.
 
S

Steven D'Aprano

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

I'm aware of this (and all the blah blah blah you are explaining). This
always the same song. Memory.

Exactly. The reason it is always the same song is because it is an
important song.

Let me ask. Is Python an 'american" product for us-users or is it a tool
for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so
important. PEP 393 means that users who have only a few non-BMP
characters don't have to pay the cost of UCS-4 for every single string in
their application, only for the ones that actually require it. PEP 393
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode
cheaper for everyone, but to make ASCII strings more expensive so that
everyone suffers equally. I reject that idea.

Is there any reason why non ascii users are somehow penalized compared
to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?


This flexible string representation is a regression (ascii users or
not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode
available more widely.

I recognize in practice the real impact is for many users closed to zero

Then what's the problem?

(including me) but I have shown (I think) that this flexible
representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all.

You have shown untrustworthy, edited timing results that don't match what
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make
any difference for real code that does useful work.
 
W

wxjmfauth

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
Proof that is acceptable to everybody please, not just yourself.
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf
 
W

wxjmfauth

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
Proof that is acceptable to everybody please, not just yourself.
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf
 
P

Paul Rubin

Steven D'Aprano said:
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.) ....
With PEP 393, each Python string will be stored in the most efficient
format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
 
W

wxjmfauth

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]
I'm aware of this (and all the blah blah blah you are explaining). This
always the same song. Memory.



Exactly. The reason it is always the same song is because it is an

important song.
No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf
 
M

MRAB

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.
[snip]

"a" will be stored as 1 byte/codepoint.

Adding "é", it will still be stored as 1 byte/codepoint.

Adding "€", it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.
 
R

rusi

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
 
M

MRAB

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.
 
M

Mark Lawrence

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :

The problem with UCS-4 is that every character requires four bytes.

I'm aware of this (and all the blah blah blah you are explaining). This
always the same song. Memory.



Exactly. The reason it is always the same song is because it is an

important song.
No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

Thinking about it I entirely agree with you. Steven D'Aprano strikes me
as typically American, in the same way that I'm typically Brazilian :)
 
T

Terry Reedy

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*10000 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c = '…'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3: .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is
not a panacea; it has problems of time, space, and system compatibility
(Windows and others). Victor Stinner, whatever he may have once thought
and said, put a *lot* of effort into making the new implementation both
correct and fast.

On your replace example
1.2918679017971044

I do not see the point of changing both length and replacement. For me,
the time is about the same for either replacement. I do see about the
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search
without replacement.

print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.
 
W

wxjmfauth

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf
 
M

Mark Lawrence

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

I thing it's time to leave the discussion and to go to bed.

In plain English, duck out cos I'm losing.
You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

I'll ask for the second time. Provide proof that is acceptable to
everybody and not just yourself.
If you see the present status as an optimisation, I'm condidering
this as a regression.

Considering does not equate to proof. Where are the figures which back
up your claim?
I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

I look forward to seeing your patch on the bug tracker. If and only if
you can find something that needs patching, which from the course of
this thread I think is highly unlikely.
 
C

Chris Angelico

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA
 
P

Paul Rubin

Chris Angelico said:
UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character?

How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.
 
C

Chris Angelico

How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top