How do I display unicode value stored in a string variable using ord()

D

DJC

Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.

Consider the automated recognition of a CAPTCHA. As the chars have to be
entered by the user on a keyboard, only the most basic charset can be
used, so the problem of which chars are possible is quite limited.
 
W

wxjmfauth

Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit :
Once again you refuse to supply anything to back up what you say.








Core developers cannot give an explanation for something that doesn't

exist, except in your imagination. Unless you can produce the evidence

that supports your claims, including details of OS, benchmarks used and

so on and so forth.











I suspect that I'll be dead and buried long before you can produce

anything concrete in the way of evidence. I've thrown down the gauntlet

several times, do you now have the courage to pick it up, or are you

going to resort to the FUD approach that you've been using throughout

this thread?



--

Cheers.



Mark Lawrence.

I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.34.99396356635981

Py 3.3b27.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% "annouced".
195.31250000000003


jmf
 
W

wxjmfauth

Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit :
Once again you refuse to supply anything to back up what you say.








Core developers cannot give an explanation for something that doesn't

exist, except in your imagination. Unless you can produce the evidence

that supports your claims, including details of OS, benchmarks used and

so on and so forth.











I suspect that I'll be dead and buried long before you can produce

anything concrete in the way of evidence. I've thrown down the gauntlet

several times, do you now have the courage to pick it up, or are you

going to resort to the FUD approach that you've been using throughout

this thread?



--

Cheers.



Mark Lawrence.

I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.34.99396356635981

Py 3.3b27.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% "annouced".
195.31250000000003


jmf
 
T

Terry Reedy

About the exemples contested by Steven:
eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
And it is good enough to show the problem. Period.

Repeating a false claim over and over does not make it true. Two people
on pydev claim that 3.3 is *faster* on their systems (one unspecified,
one OSX10.8).
 
W

wxjmfauth

Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :
"Steven D'Aprano" wrote in message




On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:



[...]

If you can consistently replicate a 100% to 1000% slowdown in string

handling, please report it as a performance bug:



http://bugs.python.org/



Don't forget to report your operating system.



====================================================

For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)

running Windows 7 x64.



Running Python from a Windows command prompt, I got the following on Python

3.2.3 and 3.3 beta 2:



python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"

10000 loops, best of 3: 39.3 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', '……')"

10000 loops, best of 3: 51.8 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"

10000 loops, best of 3: 52 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"

10000 loops, best of 3: 50.3 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')"

10000 loops, best of 3: 51.6 usec per loop

python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"

10000 loops, best of 3: 38.3 usec per loop

python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"

10000 loops, best of 3: 50.3 usec per loop



python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"

10000 loops, best of 3: 24.5 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', '……')"

10000 loops, best of 3: 24.7 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"

10000 loops, best of 3: 24.8 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"

10000 loops, best of 3: 24 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')"

10000 loops, best of 3: 24.1 usec per loop

python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"

10000 loops, best of 3: 24.4 usec per loop

python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"

10000 loops, best of 3: 24.3 usec per loop



This is an average slowdown by a factor of close to 2.3 on 3.3 when compared

with 3.2.



I am not posting this to perpetuate this thread but simply to ask whether,

as you suggest, I should report this as a possible problem with the beta?

I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.

jmf
 
T

Terry Reedy

Meanwhile, an example of the 393 approach failing:

I am completely baffled by this, as this example is one where the 393
approach potentially wins.
I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii,

3.3 stores ascii pages 1 byte/char rather than 2 or 4.
but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.

I doubt that there are really any non-bmp chars. As Steven said, reject
such false identifications.
That's a natural for UTF-8

3.3 would convert to utf-8 for storage on disk.
but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
better than always?
py> s = chr(0xFFFF + 1)
py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error. s is a one-character string and should not be unpackable.

That looks like a 3.2- narrow build. Such which treat unicode strings as
sequences of code units rather than sequences of codepoints. Not an
implementation bug, but compromise design that goes back about a decade
to when unicode was added to Python. At that time, there were only a few
defined non-BMP chars and their usage was extremely rare. There are now
more extended chars than BMP chars and usage will become more common
even in English text.

Pre 3.3, there are really 2 sub-versions of every Python version: a
narrow build and a wide build version, with not very well documented
different behaviors for any string with extended chars. That is and
would have become an increasing problem as extended chars are
increasingly used. If you want to say that what was once a practical
compromise has become a design bug, I would not argue. In any case, 3.3
fixes that split and returns Python to being one cross-platform language.
I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Python has often copied or borrowed, with adjustments. This time it is
the first. We will see how it goes, but it has been tested for nearly a
year already.
Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered. By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string. Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it. Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.

I would call it O(k), where k is a selectable constant. Slowing access
by a factor of 100 is hardly acceptable to me. For strings less than k,
access is O(len). I believe slicing would require re-indexing.

As 393 was near adoption, I proposed a scheme using utf-16 (narrow
builds) with a supplementary index of extended chars when there are any.
That makes access O(1) if there are none and O(log(k)), where k is the
number of extended chars in the string, if there are some.
 
P

Paul Rubin

Terry Reedy said:
I am completely baffled by this, as this example is one where the 393
approach potentially wins.

What? The 393 approach is supposed to avoid memory bloat and that
does the opposite.
3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.
I doubt that there are really any non-bmp chars.

You may be right about this. I thought about it some more after
posting and I'm not certain that there were supplemental characters.
As Steven said, reject such false identifications.

Reject them how?
3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.
3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
better than always?

The bloat is in comparison with utf-8, in that example.
That looks like a 3.2- narrow build. Such which treat unicode strings
as sequences of code units rather than sequences of codepoints. Not an
implementation bug, but compromise design that goes back about a
decade to when unicode was added to Python.

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list. So I'm surprised they seem to
repeated a mistake there.
I would call it O(k), where k is a selectable constant. Slowing access
by a factor of 100 is hardly acceptable to me.

If k is constant then O(k) is the same as O(1). That is how O notation
works. I wouldn't believe the 100x figure without seeing it measured in
real-world applications.
 
T

Terry Reedy

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

And I gave other examples where 3.3 is *faster* on my Windows, which you
have thus far not even acknowledged, let alone try.
It is up to you, the core developers to give an explanation
about this behaviour.

System variation, unimportance of sub-microsecond variations, and
attention to more important issues.

Other developer say 3.3 is generally faster on their sy
stems (OSX 10.8, and unspecified). To talk about speed sensibly, one
must run the full stringbench.py benchmark and real applications on
multiple Windows, *nix, and Mac systems. Python is not optimized for
your particular current computer.
 
I

Ian Kelly

The PEP explicitly states that it only uses a 1-byte format for ASCII
strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes. It is not, so I think it must be using
the 1-byte encoding.

"ASCII-only Unicode strings will again use only one byte per character"

This says nothing one way or the other about non-ASCII Latin-1 strings.
"If the maximum character is less than 128, they use the PyASCIIObject
structure"

Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures. It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data." But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.
and:

"The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient)."

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory. It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.
 
W

wxjmfauth

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf
 
W

wxjmfauth

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf
 
T

Terry Reedy

In August 2012, after 20 years of development, Python is not able to
display a piece of text correctly on a Windows console (eg cp65001).

cp65001 is known to not work right. It has been very frustrating. Bug
Microsoft about it, and indeed their whole policy of still dividing the
world into code page regions, even in their next version, instead of
moving toward unicode and utf-8, at least as an option.
I downloaded the go language, zero experience, I did not succeed to
display incorrecly a piece of text. (This is by the way *the* reason
why I tested it). Where the problems are coming from, I have no
idea.

If go can display all unicode chars on a Windows console, perhaps you
can do some research and find out how they do so. Then we could consider
copying it.
 
M

Mark Lawrence

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf

How convenient.
 
W

wxjmfauth

Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
But they are not ascii pages, they are (as stated) MOSTLY ascii.

E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses

a much more memory-expensive encoding than UTF-8.

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

jmf
 
P

Paul Rubin

Ian Kelly said:

Please try:

print (type(bytes(range(256)).decode('latin1')))

to make sure that what comes back is actually a unicode string rather
than a byte string.
 
I

Ian Kelly

Please try:

print (type(bytes(range(256)).decode('latin1')))

to make sure that what comes back is actually a unicode string rather
than a byte string.

As I understand it, the decode method never returns a byte string in
Python 3, but if you insist:
<class 'str'>
 
I

Ian Kelly

Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures. It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data." But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

Oh, now I get it. I had missed the part where it says "character data
immediately follow the base structure". And the bit about the "UTF-8
data, the UTF-8 length and the wstr length" are not describing the
contents of those fields, but rather where the data can be alternatively
found since the fields don't exist.
 
M

Mark Lawrence

Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.

Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

jmf

Well that's it then, the world stock markets will all collapse tonight
when the news leaks out that those stupid Americans haven't yet realised
that much of Europe (with at least one very noticeable and sensible
exception :) uses Euros. I'd better sell all my stock holdings fast.
 
S

Steven D'Aprano

If k is constant then O(k) is the same as O(1). That is how O notation
works.

You might as well say that if N is constant, O(N**2) is constant too and
just like magic you have now made Bubble Sort a constant-time sort
function!

That's not how it works.

Of course *if* k is constant, O(k) is constant too, but k is not
constant. In context we are talking about string indexing and slicing.
There is no value of k, say, k = 2, for which you can say "People will
sometimes ask for string[2] but never ask for string[3]". That is absurd.

Since k can vary from 0 to N-1, we can say that the average string index
lookup is k = (N-1)//2 which clearly depends on N.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,737
Latest member
Georgeengab

Latest Threads

Top