How do I display unicode value stored in a string variable using ord()

Chris Angelico · Aug 20, 2012

Analogy: how big a box is required to hold a pair of shoes? In a purely
theoretical sense we might say O(S) where S is the shoe size, treating
shoe size as an arbitrary independent variable. But in the real world,
shoe size is controlled by the size of human feet, which is bounded by a
constant for biological reasons. You don't have to consider shoes the
size of Jupiter. So it is O(1).

By that argument, everything is amortized O(1), because there's a
limit on every variable. You can't possibly be working with a data set
greater than the total sum of storage space in the entire world. That
still doesn't mean that bubble sort and heap sort are equivalently
efficient.

ChrisA

Dennis Lee Bieber · Aug 20, 2012

That's obvious, surely? We need two cases so that we can distinguish
helping Jack off a horse from helping jack off a horse.

I'm sure the horse will appreciate either action... said:
Every few years, new sizes for storage media comes out. The first thing
that happens is that people say "40 megabytes? I'll NEVER fill this hard
drive up!". The second thing that happens is that they say "Dammit, my 40
MB hard drive is full, and a new one is too expensive, better delete some
files." Followed shortly by "400 megabytes? I'll NEVER use that much
space!" -- wash, rinse, repeat, through megabytes, gigabytes, terrabytes,
and it will happen for petabytes next.

Don't remind me... I pulled an introductory digital photography
guide out of storage last week...

The device it uses to transfer data from SmartMedia cards takes two
button-cell batteries; put the SM card into it, and slide it into a 3.5"
floppy drive.

100MB ZIP drives were suggested for sharing image files with
friends...

And if I ever take my Amiga out of storage, I'm not sure I can even
find external SCSI drives for it... It could use a few more 1GB drives
<G>

wxjmfauth · Aug 20, 2012

By chance and luckily, first attempt.

IDLE, Windows 7.0 Pro 32, Pentium Dual Core 2.6, RAM 2 Go

Py 3.2.3[1.6939567134893707, 1.672874290786993, 1.6761219212298073]

Py 3.3.0b2[7.924470733910917, 7.8554985620787345, 7.878623849091914]

Console

c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
1000000 loops, best of 3: 1.48 usec per loop
c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
100000 loops, best of 3: 7.62 usec per loop

Note
The used characters are not members of the latin-1 coding
scheme (btw an *unusable* coding).
They are however charaters in cp1252 and mac-roman.

jmf

wxjmfauth · Aug 20, 2012

By chance and luckily, first attempt.

IDLE, Windows 7.0 Pro 32, Pentium Dual Core 2.6, RAM 2 Go

Py 3.2.3[1.6939567134893707, 1.672874290786993, 1.6761219212298073]

Py 3.3.0b2[7.924470733910917, 7.8554985620787345, 7.878623849091914]

Console

c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
1000000 loops, best of 3: 1.48 usec per loop
c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
100000 loops, best of 3: 7.62 usec per loop

Note
The used characters are not members of the latin-1 coding
scheme (btw an *unusable* coding).
They are however charaters in cp1252 and mac-roman.

jmf

Paul Rubin · Aug 20, 2012

Ian Kelly said:
The difference between the two is that the former is bounded by a
constant that is fundamental to the algorithm at hand,... S is
clearly bounded by the constraints of actual shoes, so we can safely
treat S as a constant and call it O(N).

Thanks, that is a good explain of what I was trying to get at. One
quibble is in the parsing example, the constant is inherent in the
distribution of input data one expects to normally encounter, rather
than in the algorithm itself. It's sort of like saying dictionary
access (based on hashing) is O(1), based on the variety of input data
that is normally used. There is such a thing as pathological
(e.g. malicious) input data that causes a lot of hash collisions, making
O(n) access in the size of the dictionary.

I suppose abnormal input should also be taken into account in examples
involving parsing if one parses potentially hostile data.

Piet van Oostrum · Aug 20, 2012

Blind Anagram said:
This is an average slowdown by a factor of close to 2.3 on 3.3 when
compared with 3.2.

I am not posting this to perpetuate this thread but simply to ask
whether, as you suggest, I should report this as a possible problem with
the beta?

Being a beta release, is it certain that this release has been compiled
with the same optimization level as 3.2?

88888 Dihedral · Aug 20, 2012

Paul Rubinæ–¼ 2012å¹´8æœˆ21æ—¥æ˜ŸæœŸäºŒUTC+8ä¸Šåˆ3æ™‚29åˆ†12ç§’å¯«é“ï¼š

Thanks, that is a good explain of what I was trying to get at. One

quibble is in the parsing example, the constant is inherent in the

distribution of input data one expects to normally encounter, rather

than in the algorithm itself. It's sort of like saying dictionary

access (based on hashing) is O(1), based on the variety of input data

that is normally used. There is such a thing as pathological

(e.g. malicious) input data that causes a lot of hash collisions, making

O(n) access in the size of the dictionary.

I suppose abnormal input should also be taken into account in examples

involving parsing if one parses potentially hostile data.

OK, the hash key function has to be seeded with some randomization
in the begining of a hash table from some data independent factors.

Also for those items hashed to the same key with a length >=16
I think an insertion sort is good in soring a sorted list of items hashed to
the same key of a length 16 to 256 or even larger which indicates
the hash function should be changed from time to time in the
occasions of resizing the hash table when the table is under a lot operations
of item inserstions and deletions which are much greater than the number
of item searches in the same period.

88888 Dihedral · Aug 20, 2012

Paul Rubinæ–¼ 2012å¹´8æœˆ21æ—¥æ˜ŸæœŸäºŒUTC+8ä¸Šåˆ3æ™‚29åˆ†12ç§’å¯«é“ï¼š

Thanks, that is a good explain of what I was trying to get at. One

quibble is in the parsing example, the constant is inherent in the

distribution of input data one expects to normally encounter, rather

than in the algorithm itself. It's sort of like saying dictionary

access (based on hashing) is O(1), based on the variety of input data

that is normally used. There is such a thing as pathological

(e.g. malicious) input data that causes a lot of hash collisions, making

O(n) access in the size of the dictionary.

I suppose abnormal input should also be taken into account in examples

involving parsing if one parses potentially hostile data.

OK, the hash key function has to be seeded with some randomization
in the creation of a hash table from some data independent factors.

Also for those items hashed to the same key with a length >=16
I think an insertion sort and a heap sort are good in storing a sorted listof items hashed to the same key of a length 16 to 256 or even larger whichindicates the hash function should be changed from time to time in the
occasions of resizing the hash table when the table is under a lot operations
of item insertions and deletions which are operated much frequently
than the number of item searches in the same period.

Ned Deily · Aug 21, 2012

Note
The used characters are not members of the latin-1 coding
scheme (btw an *unusable* coding).
They are however charaters in cp1252 and mac-roman.

mac-roman is an obsolete encoding that was used in MacOS 9 and MacOS
Classic systems of previous decades. Except in a very small and
shrinking number of legacy applications, it isn't used in modern systems
and hasn't been widely used for a long time. MacOS X systems generally
use standard Unicode encodings, usually UTF-8, for external
representations. Forget about mac-roman; it is not relevant.

Michael Torrie · Aug 21, 2012

And if you want the "fudge it somehow" behavior (which is often very
useful!), there's always http://pypi.python.org/pypi/Unidecode/

Sweet tip, thanks! I often want to process text that has smart quotes,
emdashes, etc, and convert them to plain old ascii quotes, dashes,
ticks, etc. This will do that for me without resorting to a bunch of
regexes. Bravo.

Steven D'Aprano · Aug 21, 2012

The difference between the two is that the former is bounded by a
constant that is fundamental to the algorithm at hand, whereas the
latter is bounded only by available resources. By any practical
consideration the latter must be considered a variable, but the former
need not be.

Paul discusses above the asymptotic growth of a variable as O(S) where S
is shoe size, but really this measure makes no sense in the first place.
The question that this notation seeks to answer is, "What is the big-Oh
behaviour of this variable as shoe size increases indefinitely (i.e. to
infinity)?" and the answer in this case is "Shoe size does not increase
indefinitely"; the question is invalid.

A more logical question might be, "How much material do I need to
construct N shoes of size S?" The answer to this question would
presumably be some constant factor of N * S**2, which is O(N * S**2).
Although N can be assumed to vary freely (up to nonsensical quantities
like the mass of the entire universe), S is clearly bounded by the
constraints of actual shoes, so we can safely treat S as a constant and
call it O(N).

Why the hell are we talking about shoe sizes?

Let's not lose sight of the actual question in hand: how expensive is it
to fetch a character at some arbitrary index in a string?

With fixed-width characters, average time is O(1), best-case time is O(1)
and worst-case time is O(1).

With non-fixed width characters, average time and worst-case time are
both O(N), and best-case ("give me the character at index 0") is O(1).
But that's a degenerate case that gives us absolutely no insight into the
cost of the operation. It's like the observation that "sorting a list of
one item takes constant time, regardless of the algorithm". Um, yes, it
does. So what?

Invented constraints like "people only care about indexes really close to
zero, or to the end of the string", effectively confuse the best possible
case for the average case. In real life, people do care about random
access to strings and the average case is more important than the best
case. That's why strings are typically implemented as arrays of fixed-
width characters rather than linked lists, and Haskell (which doesn't)
explicitly warns you that they don't and that your string-handling code
is likely to be slow.

Of course, you can swap space and complexity for time, e.g. ropes, or use
cached pointers to character locations in the hope that indexes will be
clustered rather than scattered randomly. These more complex
implementations typically use as much memory than, and performance is
often no better than, a dead simple implementation that uses a simple
array of fixed width characters. Most harmful, it means that a simple
indexing operation may be fast on average, and occasionally degenerate to
slow. The only thing worse than "This program is slow" is "This program
is *occasionally* slow, and I can't predict when".

Not surprising, almost all programming languages in common usage use
arrays of fixed-width characters as their native string type. Python 3.3
will now do the same thing, with the memory optimization that the fixed-
width will be per string and selected from 1, 2, or 4 bytes according to
the minimum width needed, rather than choosing between 2 and 4 bytes when
you build the Python compiler.

This is a big positive step forward with a pretty simple algorithm and
will strongly help encourage the use of Unicode by taking much of the
sting out of the perennial complaints that "Unicode wastes space". Not so
wasteful any more.

Roy Smith · Aug 21, 2012

And if you want the "fudge it somehow" behavior (which is often very
useful!), there's always http://pypi.python.org/pypi/Unidecode/

Sweet tip, thanks! I often want to process text that has smart quotes,
emdashes, etc, and convert them to plain old ascii quotes, dashes,
ticks, etc. This will do that for me without resorting to a bunch of
regexes. Bravo.[/QUOTE]

Yup, that's one of the things it's good for. We mostly use it to help
map search terms, i.e. if you search for "beyonce", you're probably
expecting it to match "Beyoncé".

We also special-case some weird stuff like "kesha" matching "ke$ha", but
we have to hand-code those.

wxjmfauth · Aug 21, 2012

Le mardi 21 août 2012 09:52:09 UTC+2, Peter Otten a écrit :

(e-mail address removed) wrote:

OK, that is roughly factor 5. Let's see what I get:

$ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'

100000 loops, best of 3: 1.8 usec per loop

$ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'

10000 loops, best of 3: 9.11 usec per loop

That is factor 5, too. So I can replicate your measurement on an AMD64 Linux

system with self-built 3.3 versus system 3.2.

You seem to imply that the slowdown is connected to the inability of latin-1

to encode "œ" and "€" (to take the examples relevant to the above

microbench). So let's repeat with latin-1 characters:

$ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'

100000 loops, best of 3: 1.76 usec per loop

$ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'

10000 loops, best of 3: 10.3 usec per loop

Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory

that an unfortunate choice of the 8 bit encoding is causing it. Do you

agree?

- I do not care too much about the numbers. It's
an attempt to show the principles.

- The fact, considering latin-1 as a bad coding,
lies on the point that is is simply unsuable
for some scripts / languages. It has mainly to do
with source/text files coding. This is not really
the point here.

- Now, the technical aspect. This "coding" (latin-1)
may be considered somehow as the pseudo-coding covering
the unicode code points range 128..255. Unfortunatelly,
this "coding" is not very optimal (or can be see as) when
you work with a full range of Unicode, but is is fine
when one works only in pure latin-1, with only 256
characters.
This range 128..255 is always the critical part
(all codings considered). And probably represents
the most used characters.

I hope, it was not too confused.

I have no proof for my theory. With my experience on that
field, I highly suspect this as the bottleneck.

Some os as before.

Py 3.2.3

timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [1.5384088242603358, 1.532421642233382, 1.5327445924545433]
timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")

Click to expand...

Click to expand...

[1.561762063667686, 1.5443503206462594, 1.5458670051605168]

3.3.0b2

timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [7.701523104134512, 7.720358191179441, 7.614549852683501]>>>
timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")

Click to expand...

Click to expand...

[4.887939423990709, 4.868787294350611, 4.865697999795991]

Quite mysterious!

In any way it is a regression.

jmf

Hans Mulder · Aug 22, 2012

That looks like a 3.2- narrow build. Such which treat unicode strings
as sequences of code units rather than sequences of codepoints. Not an
implementation bug, but compromise design that goes back about a
decade to when unicode was added to Python.

Actually, this compromise design was new in 3.0.

In 2.x, unicode strings were sequences of code points.
Narrow builds rejected any code points > 0xFFFF:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

-- HansM

In R Shiny, How do I ensure variable value propagation within same code block in R?	0	Sep 29, 2022
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
Trouble accessing a value within a JSON string.	1	Jun 16, 2023
How can I find occurrences of a column name FPPaymentID in the entire database (e.g table, stored procedure etc) in SSMS?	2	Jun 20, 2023
Unicode codepoints	5	Jun 22, 2011
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 19, 2024
How do I make my code output information about the students in the C# assignment:	1	Nov 28, 2024

How do I display unicode value stored in a string variable using ord()

Chris Angelico

Dennis Lee Bieber

wxjmfauth

wxjmfauth

Paul Rubin

Piet van Oostrum

88888 Dihedral

88888 Dihedral

Ned Deily

Michael Torrie

Steven D'Aprano

Roy Smith

wxjmfauth

Hans Mulder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads