How do I display unicode value stored in a string variable using ord()

wxjmfauth · Aug 18, 2012

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf

wxjmfauth · Aug 18, 2012

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf

Chris Angelico · Aug 18, 2012

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace lookups do, among other things.
And how is performance in the (very real) case where a C routine wants
to return a value to Python as a string, where the data is currently
guaranteed to be ASCII (previously using PyUnicode_FromString, now
able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been
gone into in great detail before the PEP was accepted (am I
negative-bikeshedding here? "atomic reactoring"???), and I'm sure that
the gains outweigh the costs.

ChrisA

Mark Lawrence · Aug 18, 2012

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Proof that is acceptable to everybody please, not just yourself.

Steven D'Aprano · Aug 18, 2012

Le samedi 18 aoÃ»t 2012 14:27:23 UTC+2, Steven D'Aprano a Ã©critÂ :

[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

Click to expand...

I'm aware of this (and all the blah blah blah you are explaining). This
always the same song. Memory.

Exactly. The reason it is always the same song is because it is an
important song.

Let me ask. Is Python an 'american" product for us-users or is it a tool
for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so
important. PEP 393 means that users who have only a few non-BMP
characters don't have to pay the cost of UCS-4 for every single string in
their application, only for the ones that actually require it. PEP 393
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode
cheaper for everyone, but to make ASCII strings more expensive so that
everyone suffers equally. I reject that idea.

Is there any reason why non ascii users are somehow penalized compared
to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?

This flexible string representation is a regression (ascii users or
not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode
available more widely.

I recognize in practice the real impact is for many users closed to zero

Then what's the problem?

(including me) but I have shown (I think) that this flexible
representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all.

You have shown untrustworthy, edited timing results that don't match what
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make
any difference for real code that does useful work.

wxjmfauth · Aug 18, 2012

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :

Proof that is acceptable to everybody please, not just yourself.

I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf

wxjmfauth · Aug 18, 2012

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :

Proof that is acceptable to everybody please, not just yourself.

I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf

Paul Rubin · Aug 18, 2012

Steven D'Aprano said:
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.) ....
With PEP 393, each Python string will be stored in the most efficient
format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

wxjmfauth · Aug 18, 2012

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :

[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

Click to expand...

I'm aware of this (and all the blah blah blah you are explaining). This

Click to expand...

always the same song. Memory.

Click to expand...

Exactly. The reason it is always the same song is because it is an

important song.

No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

MRAB · Aug 18, 2012

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

[snip]

"a" will be stored as 1 byte/codepoint.

Adding "é", it will still be stored as 1 byte/codepoint.

Adding "€", it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.

rusi · Aug 18, 2012

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

MRAB · Aug 18, 2012

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.

Mark Lawrence · Aug 18, 2012

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :

[...]

Click to expand...

The problem with UCS-4 is that every character requires four bytes.

[...]

Click to expand...

I'm aware of this (and all the blah blah blah you are explaining). This

Click to expand...

always the same song. Memory.

Click to expand...

Exactly. The reason it is always the same song is because it is an

important song.

Click to expand...

No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

Thinking about it I entirely agree with you. Steven D'Aprano strikes me
as typically American, in the same way that I'm typically Brazilian

Mark Lawrence · Aug 18, 2012

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

ROFLMAO doesn't adequately some up how much I laughed.

Terry Reedy · Aug 18, 2012

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*10000 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c = 'â€¦'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3: .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is
not a panacea; it has problems of time, space, and system compatibility
(Windows and others). Victor Stinner, whatever he may have once thought
and said, put a *lot* of effort into making the new implementation both
correct and fast.

On your replace example

1.2918679017971044

I do not see the point of changing both length and replacement. For me,
the time is about the same for either replacement. I do see about the
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search
without replacement.

print(timeit("c in a", "c = 'â€¦'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

wxjmfauth · Aug 18, 2012

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf

Mark Lawrence · Aug 18, 2012

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

I thing it's time to leave the discussion and to go to bed.

In plain English, duck out cos I'm losing.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

I'll ask for the second time. Provide proof that is acceptable to
everybody and not just yourself.

If you see the present status as an optimisation, I'm condidering
this as a regression.

Considering does not equate to proof. Where are the figures which back
up your claim?

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

I look forward to seeing your patch on the bug tracker. If and only if
you can find something that needs patching, which from the course of
this thread I think is highly unlikely.

Chris Angelico · Aug 19, 2012

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA

Paul Rubin · Aug 19, 2012

Chris Angelico said:
UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character?

How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.

Chris Angelico · Aug 19, 2012

How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?

"asdfqwer"[4:]

Click to expand...

Click to expand...

'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA

In R Shiny, How do I ensure variable value propagation within same code block in R?	0	Sep 29, 2022
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
Trouble accessing a value within a JSON string.	1	Jun 16, 2023
How can I find occurrences of a column name FPPaymentID in the entire database (e.g table, stored procedure etc) in SSMS?	2	Jun 20, 2023
Unicode codepoints	5	Jun 22, 2011
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 19, 2024
How do I make my code output information about the students in the C# assignment:	1	Nov 28, 2024

How do I display unicode value stored in a string variable using ord()

wxjmfauth

wxjmfauth

Chris Angelico

Mark Lawrence

Steven D'Aprano

wxjmfauth

wxjmfauth

Paul Rubin

wxjmfauth

MRAB

rusi

MRAB

Mark Lawrence

Mark Lawrence

Terry Reedy

wxjmfauth

Mark Lawrence

Chris Angelico

Paul Rubin

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads