How do I display unicode value stored in a string variable using ord()

W

wxjmfauth

Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
internal implementation, and strings which fit exactly in Latin-1 will

And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).

We fall on the point I mentionned above. Microsoft know this, ditto
for Apple, ditto for "TeX", ditto for the foundries.
Even, "ISO" has recognized its error and produced iso-8859-15.

The question? Why is it still used?

jmf
 
P

Peter Otten

Steven said:
Steven D'Aprano wrote:
I don't know where people are getting this myth that PEP 393 uses
Latin-1 internally, it does not. Read the PEP, it explicitly states
that 1-byte formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
import sys
[sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because
that would explain why your sizes are so larger than mine:

py> [sys.getsizeof("é"*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py> [sys.getsizeof("€"*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

Yes, I am using a 64-bit build. I thought that

would convey that. The corresponding data structure

typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double
in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the
embedded PyASCIIObject yourself.
 
C

Chris Angelico

The date stamp is different but the Python version is the same

Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.

ChrisA
 
W

wxjmfauth

Le dimanche 19 août 2012 11:37:09 UTC+2, Peter Otten a écrit :


You know, the techincal aspect is one thing. Understanding
the coding of the characters as a whole is something
else. The important point is not the coding per se, the
relevant point is the set of characters a coding may
represent.

You can build the most sophisticated mechanism you which,
if it does not take that point into account, it will
always fail or be not optimal.

This is precicely the weak point of this flexible
representation. It uses latin-1 and latin-1 is for
most users simply unusable.

Fascinating, isn't it? Devs are developing sophisticed
tools based on a non working basis.

jmf
 
W

wxjmfauth

Le dimanche 19 août 2012 11:37:09 UTC+2, Peter Otten a écrit :


You know, the techincal aspect is one thing. Understanding
the coding of the characters as a whole is something
else. The important point is not the coding per se, the
relevant point is the set of characters a coding may
represent.

You can build the most sophisticated mechanism you which,
if it does not take that point into account, it will
always fail or be not optimal.

This is precicely the weak point of this flexible
representation. It uses latin-1 and latin-1 is for
most users simply unusable.

Fascinating, isn't it? Devs are developing sophisticed
tools based on a non working basis.

jmf
 
C

Chris Angelico

This is precicely the weak point of this flexible
representation. It uses latin-1 and latin-1 is for
most users simply unusable.

No, it uses Unicode, and as an optimization, attempts to store the
codepoints in less than four bytes for most strings. The fact that a
one-byte storage format happens to look like latin-1 is rather
coincidental.

ChrisA
 
M

Mark Lawrence

About the exemples contested by Steven:

eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are "coded". I'm the first
to think, this is legitimate.

Memory or "ability to treat all text in the same and equal
way"?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf

Roughly translated. "I've been shot to pieces and having seen Monty
Python and the Holy Grail I know what to do. Run away, run away"
 
S

Steven D'Aprano

Steven D'Aprano said:
result = text[end:]

if end not near the end of the original string, then this is O(N) even
with fixed-width representation, because of the char copying.

Technically, yes. But it's a straight copy of a chunk of memory, which
means it's fast: your OS and hardware tries to make straight memory
copies as fast as possible. Big-Oh analysis frequently glosses over
implementation details like that.

Of course, that assumption gets shaky when you start talking about extra
large blocks, and it falls apart completely when your OS starts paging
memory to disk.

But if it helps to avoid irrelevant technical details, change it to
text[end:end+10] or something.

if it is near the end, by knowing where the string data area ends, I
think it should be possible to scan backwards from the end, recognizing
what bytes can be the beginning of code points and counting off the
appropriate number. This is O(1) if "near the end" means "within a
constant".

You know, I think you are misusing Big-Oh analysis here. It really
wouldn't be helpful for me to say "Bubble Sort is O(1) if you only sort
lists with a single item". Well, yes, that is absolutely true, but that's
a special case that doesn't give you any insight into why using Bubble
Sort as your general purpose sort routine is a terrible idea.

Using variable-sized strings like UTF-8 and UTF-16 for in-memory
representations is a terrible idea because you can't assume that people
will only every want to index the first or last character. On average,
you need to scan half the string, one character at a time. In Big-Oh, we
can ignore the factor of 1/2 and just say we scan the string, O(N).

That's why languages tend to use fixed character arrays for strings.
Haskell is an exception, using linked lists which require traversing the
string to jump to an index. The manual even warns:

If you think of a Text value as an array of Char values (which it is
not), you run the risk of writing inefficient code.

An idiom that is common in some languages is to find the numeric offset
of a character or substring, then use that number to split or trim the
searched string. With a Text value, this approach would require two O(n)
operations: one to perform the search, and one to operate from wherever
the search ended.
[end quote]

http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html
 
W

wxjmfauth

Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
No, it uses Unicode, and as an optimization, attempts to store the

codepoints in less than four bytes for most strings. The fact that a

one-byte storage format happens to look like latin-1 is rather

coincidental.

And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf
 
W

wxjmfauth

Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
No, it uses Unicode, and as an optimization, attempts to store the

codepoints in less than four bytes for most strings. The fact that a

one-byte storage format happens to look like latin-1 is rather

coincidental.

And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf
 
D

Dave Angel

Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf

But that choice was made decades ago when Unicode picked its second 128
characters. The internal form used in this PEP is simply the low-order
byte of the Unicode code point. Trying to scan the string deciding if
converting to cp1252 (for example) would be a much more expensive
operation than seeing how many bytes it'd take for the largest code point.
 
D

Dave Angel

(pardon the resend, but I accidentally omitted a couple of words)
Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf

But that choice was made decades ago when Unicode picked its second 128
characters. The internal form used in this PEP is simply the low-order
byte of the Unicode code point. Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256. That
is a simple description, and simple code. As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.
 
W

wxjmfauth

Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
But that choice was made decades ago when Unicode picked its second 128

characters. The internal form used in this PEP is simply the low-order

byte of the Unicode code point. Trying to scan the string deciding if

converting to cp1252 (for example) would be a much more expensive

operation than seeing how many bytes it'd take for the largest code point..

You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:
'0x1.199999999999ap+0'

but it is not able to display a piece of text!

Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Regards,
jmf
 
W

wxjmfauth

Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
But that choice was made decades ago when Unicode picked its second 128

characters. The internal form used in this PEP is simply the low-order

byte of the Unicode code point. Trying to scan the string deciding if

converting to cp1252 (for example) would be a much more expensive

operation than seeing how many bytes it'd take for the largest code point..

You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:
'0x1.199999999999ap+0'

but it is not able to display a piece of text!

Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Regards,
jmf
 
S

Steven D'Aprano

My own understanding is UCS-2 simply shouldn't be used any more.

Pretty much. But UTF-16 with lax support for surrogates (that is,
surrogates are included but treated as two characters) is essentially
UCS-2 with the restriction against surrogates lifted. That's what Python
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is
pretty poor. Even when applications support it, most fonts don't have
glyphs for the characters. Anything which makes handling of Unicode
supplementary characters better is a step forward.

This I don't see. What are the basic string operations?

The ones I'm specifically referring to are indexing and copying
substrings. There may be others.

* Examine the first character, or first few characters ("few" = "usually
bounded by a small constant") such as to parse a token from an input
stream. This is O(1) with either encoding.

That's actually O(K), for K = "a few", whatever "a few" means. But we
know that anything is fast for small enough N (or K in this case).

* Slice off the first N characters. This is O(N) with either encoding
if it involves copying the chars. I guess you could share references
into the same string, but if the slice reference persists while the
big reference is released, you end up not freeing the memory until
later than you really should.

As a first approximation, memory copying is assumed to be free, or at
least constant time. That's not strictly true, but Big Oh analysis is
looking at algorithmic complexity. It's not a substitute for actual
benchmarks.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.
So
the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision. That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.


[...]
I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

There has to be a first time for everything.

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.

Ropes have been considered and rejected because while they are
asymptotically fast, in common cases the added complexity actually makes
them slower. Especially for immutable strings where you aren't inserting
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their
native string data structure. But that's ropes of *bytes*, not UTF-8.

http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html
 
S

Steven D'Aprano

This is precicely the weak point of this flexible representation. It
uses latin-1 and latin-1 is for most users simply unusable.

That's very funny.

Are you aware that your post is entirely Latin-1?

Fascinating, isn't it? Devs are developing sophisticed tools based on a
non working basis.

At the end of the day, PEP 393 fixes some major design limitations of the
Unicode implementation in the "narrow build" Python, while saving memory
for people using the "wide build". Everybody wins here. Your objection
appears to be based on some sort of philosophical objection to Latin-1
than on any genuine problem.
 
M

Mark Lawrence

Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :

You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

Please give a precise description of the design mistake and what you
would do to correct it.
People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

Examples please.
I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:

'0x1.199999999999ap+0'

but it is not able to display a piece of text!

So you keep saying, but when asked for examples or evidence nothing gets
produced.
Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Windows is a law unto itself. Its problems are hardly specific to Python.
Regards,
jmf

Now two or three times you've said you're going but have come back. If
you come again could you please provide examples and or evidence of what
you're on about, because you still have me baffled.
 
W

wxjmfauth

Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
Please give a precise description of the design mistake and what you

would do to correct it.










Examples please.











So you keep saying, but when asked for examples or evidence nothing gets

produced.










Windows is a law unto itself. Its problems are hardly specific to Python..






Now two or three times you've said you're going but have come back. If

you come again could you please provide examples and or evidence of what

you're on about, because you still have me baffled.



--

Cheers.



Mark Lawrence.

Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1, to ...)

jmf
 
W

wxjmfauth

Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
Please give a precise description of the design mistake and what you

would do to correct it.










Examples please.











So you keep saying, but when asked for examples or evidence nothing gets

produced.










Windows is a law unto itself. Its problems are hardly specific to Python..






Now two or three times you've said you're going but have come back. If

you come again could you please provide examples and or evidence of what

you're on about, because you still have me baffled.



--

Cheers.



Mark Lawrence.

Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1, to ...)

jmf
 
M

Mark Lawrence

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

Once again you refuse to supply anything to back up what you say.
It is up to you, the core developers to give an explanation
about this behaviour.

Core developers cannot give an explanation for something that doesn't
exist, except in your imagination. Unless you can produce the evidence
that supports your claims, including details of OS, benchmarks used and
so on and so forth.
As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1, to ...)

jmf

I suspect that I'll be dead and buried long before you can produce
anything concrete in the way of evidence. I've thrown down the gauntlet
several times, do you now have the courage to pick it up, or are you
going to resort to the FUD approach that you've been using throughout
this thread?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top