How do I display unicode value stored in a string variable using ord()

S

Steven D'Aprano

On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano


I think you misunderstand the PEP then, because that is empirically
false.

Yes I did misunderstand. Thank you for the clarification.
 
S

Steven D'Aprano

"Steven D'Aprano" wrote in message
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:

http://bugs.python.org/

Don't forget to report your operating system.
[...]

This is an average slowdown by a factor of close to 2.3 on 3.3 when
compared with 3.2.

I am not posting this to perpetuate this thread but simply to ask
whether, as you suggest, I should report this as a possible problem with
the beta?

Possibly, if it is consistent and non-trivial. Serious performance
regressions are bugs. Trivial ones, not so much.

Thanks to Terry Reedy, who has already asked the Python Devs about this
issue, they have made it clear that they aren't hugely interested in
micro-benchmarks in isolation. If you want the bug report to be taken
seriously, you would need to run the full Python string benchmark. The
results of that would be interesting to see.
 
T

Terry Reedy

Running Python from a Windows command prompt, I got the following on
Python 3.2.3 and 3.3 beta 2:

python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 39.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','……')"
10000 loops, best of 3: 51.8 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','x…')"
10000 loops, best of 3: 52 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','œ…')"
10000 loops, best of 3: 50.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…','€…')"
10000 loops, best of 3: 51.6 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 38.3 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 50.3 usec per loop

python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 24.5 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','……')"
10000 loops, best of 3: 24.7 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','x…')"
10000 loops, best of 3: 24.8 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','œ…')"
10000 loops, best of 3: 24 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…','€…')"
10000 loops, best of 3: 24.1 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 24.4 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 24.3 usec per loop

This is one test repeated 7 times with essentially irrelevant
variations. The difference is less on my system (50%). Others report
seeing 3.3 as faster. When I asked on pydev, the answer was don't bother
making a tracker issue unless I was personally interested in
investigating why search is relatively slow in 3.3 on Windows. Any
change would have to not slow other operations or severely impact search
on other systems. I suggest the same answer to you.

If you seriously want to compare old and new unicode, go to
http://hg.python.org/cpython/file/tip/Tools/stringbench/stringbench.py
and click raw to download. Run on 3.2 and 3.3, ignoring the bytes times.

Here is a version of the first comparison from stringbench:
print(timeit('''('NOW IS THE TIME FOR ALL GOOD PEOPLE TO COME TO THE AID
OF PYTHON'* 10).lower()'''))
Results are 5.6 for 3.2 and .8 for 3.3. WOW! 3.3 is 7 times faster!

OK, not fair. I cherry picked. The 7 times speedup in 3.3 likely is at
least partly independent of the 393 unicode change. The same test in
stringbench for bytes is twice as fast in 3.3 as 3.2, but only 2x, not
7x. In fact, it may have been the bytes/unicode comparison in 3.2 that
suggested that unicode case conversion of ascii chrs might be made faster.

The sum of the 3.3 unicode times is 109 versus 110 for 3.3 bytes and 125
for 3.2 unicode. This unweighted sum is not really fair since the raw
times vary by a factor of at least 100. But is does suggest that anyone
claiming that 3.3 unicode is overall 'slower' than 3.2 unicode has some
work to do.

There is also this. On my machine, the lowest bytes-time/unicode-time
for 3.3 is .71. This suggests that there is not a lot of fluff left in
the unicode code, and that not much is lost by the bytes to unicode
switch for strings.
 
C

Chris Angelico

How convenient.

Not really. Even if he HAD copied-and-pasted those worst-cases, it'd
prove nothing. Maybe his system just chose to glitch right then. It's
always possible to find statistical outliers that take way way longer
than everything else.

Watch this. Python 3.2 on Windows is optimized for adding 1 to numbers.

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+2"
10000000 loops, best of 1: 0.0711 usec per loop

Now, as long as I don't tell you that during the last test I had quite
a few other processes running, including VLC playing a movie and two
Python processes running "while True: pass", this will look like a
significant performance difference. So now, I'm justified in
complaining about how suboptimal Python is when adding 2 to a number,
which I can assure you is a VERY common case.

ChrisA
 
T

Terry Reedy

Well, it seems some software producers know what they
are doing.

Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

Yes, Python lets you choose your byte encoding from those and a hundred
others. I believe all the codecs are now tested in both directions. It
was not an easy task.

As to the examples: Latin-1 dates to 1985 and before and the 1988
version was published as a standard in 1992.
https://en.wikipedia.org/wiki/Latin-1
"The name euro was officially adopted on 16 December 1995."
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International
standards organizations standards are relatively fixed. (The unicode
consortium will not even correct misspelled character names.) Instead,
new standards with a new number are adopted.

For better or worse, private mappings are more flexible. In its Mac
mapping Apple "replaced the generic currency sign ¤ with the euro sign
€". (See Latin-1 reference.) Great if you use Euros, not so greatif you
were using the previous sign for something else.

Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
"It is very common to mislabel Windows-1252 text with the charset label
ISO-8859-1. A common result was that all the quotes and apostrophes
(produced by "smart quotes" in Microsoft software) were replaced with
question marks or boxes on non-Windows operating systems, making text
difficult to read. Most modern web browsers and e-mail clients treat the
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such
mislabeling. This is now standard behavior in the draft HTML 5
specification, which requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding.[1]"

Lots of fun. Too bad Microsoft won't push utf-8 so we can all
communicate text with much less chance of ambiguity.
 
C

Chris Angelico

Python has often copied or borrowed, with adjustments. This time it is the
first. We will see how it goes, but it has been tested for nearly a year
already.

Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are <256, the string width is 8 (measured in bits);
if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a >255 codepoint - or perhaps worse, when you
have a 127<x<256, and the other end misinterprets it.

Really, the only viable alternative to PEP 393 is a fixed 32-bit
representation - it's the only way that's guaranteed to provide
equivalent semantics. The new storage format is guaranteed to take no
more memory than that, and provide equivalent functionality.

ChrisA
 
R

Roy Smith

Chris Angelico said:
Really, the only viable alternative to PEP 393 is a fixed 32-bit
representation - it's the only way that's guaranteed to provide
equivalent semantics. The new storage format is guaranteed to take no
more memory than that, and provide equivalent functionality.

In the primordial days of computing, using 8 bits to store a character
was a profligate waste of memory. What on earth did people need with
TWO cases of the alphabet (not to mention all sorts of weird
punctuation)? Eventually, memory became cheap enough that the
convenience of using one character per byte (not to mention 8-bit bytes)
outweighed the costs. And crazy things like sixbit and rad-50 got swept
into the dustbin of history.

So it may be with utf-8 someday.

Clearly, the world has moved to a 32-bit character set. Not all parts
of the world know that yet, or are willing to admit it, but that doesn't
negate the fact that it's true. Equally clearly, the concept of one
character per byte is a big win. The obvious conclusion is that
eventually, when memory gets cheap enough, we'll all be doing utf-32 and
all this transcoding nonsense will look as antiquated as rad-50 does
today.
 
P

Paul Rubin

Steven D'Aprano said:
Of course *if* k is constant, O(k) is constant too, but k is not
constant. In context we are talking about string indexing and slicing.
There is no value of k, say, k = 2, for which you can say "People will
sometimes ask for string[2] but never ask for string[3]". That is absurd.

The context was parsing, e.g. recognizing a token like "a" or "foo" in a
human-written chunk of text. Occasionally it might be "sesquipidalian"
or some even worse outlier, but one can reasonably put a fixed and
relatively small upper bound on the expected value of k. That makes the
amortized complexity O(1), I think.
 
8

88888 Dihedral

"Steven D'Aprano" wrote in message




On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:



[...]

If you can consistently replicate a 100% to 1000% slowdown in string

handling, please report it as a performance bug:



http://bugs.python.org/



Don't forget to report your operating system.



====================================================

For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)

running Windows 7 x64.



Running Python from a Windows command prompt, I got the following on Python

3.2.3 and 3.3 beta 2:



python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"

10000 loops, best of 3: 39.3 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', '……')"

10000 loops, best of 3: 51.8 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"

10000 loops, best of 3: 52 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"

10000 loops, best of 3: 50.3 usec per loop

python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')"

10000 loops, best of 3: 51.6 usec per loop

python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"

10000 loops, best of 3: 38.3 usec per loop

python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"

10000 loops, best of 3: 50.3 usec per loop



python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"

10000 loops, best of 3: 24.5 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', '……')"

10000 loops, best of 3: 24.7 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"

10000 loops, best of 3: 24.8 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"

10000 loops, best of 3: 24 usec per loop

python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')"

10000 loops, best of 3: 24.1 usec per loop

python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"

10000 loops, best of 3: 24.4 usec per loop

python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"

10000 loops, best of 3: 24.3 usec per loop



This is an average slowdown by a factor of close to 2.3 on 3.3 when compared

with 3.2.



I am not posting this to perpetuate this thread but simply to ask whether,

as you suggest, I should report this as a possible problem with the beta?

Un, another set of functions for seeding up ASCII string othe pertions
might be needed. But it is better that Python 3.3 supports unicode strings
to be easy to be used by people in different languages first.

Anyway I think Cython and Pyrex can be used to tackle this problem.
 
T

Terry Reedy

I should have added 'that I know of' ;-)
Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are <256, the string width is 8 (measured in bits);
if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

It is even possible that someone involved was even vaguely aware that
there was an antecedent. The PEP makes no claim that I can see, but lays
out the problem and goes right to details of a Python implementation.
However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a >255 codepoint - or perhaps worse, when you
have a 127<x<256, and the other end misinterprets it.

Python writes strings to file objects, including open sockets, without
creating a bytes object -- IF the file is opened in text mode, which
always has an associated encoding, even if the default 'ascii'. From
what you say, this is what Pike is missing.

I am pretty sure that the obvious optimization has already been done.
The internal bytes of all-ascii text can safely be sent to a file with
ascii (or ascii-compatible) encoding without intermediate 'decoding'. I
remember several patches of that sort. If a string is internally ucs2
and the file is declared usc2 or utf-16 encoding, then again, pairs of
bytes can go directly (possibly with a byte swap).
 
C

Chris Angelico

Python writes strings to file objects, including open sockets, without
creating a bytes object -- IF the file is opened in text mode, which always
has an associated encoding, even if the default 'ascii'. From what you say,
this is what Pike is missing.

In text mode, the library does the encoding, but an encoding still happens.
I am pretty sure that the obvious optimization has already been done. The
internal bytes of all-ascii text can safely be sent to a file with ascii (or
ascii-compatible) encoding without intermediate 'decoding'. I remember
several patches of that sort. If a string is internally ucs2 and the file is
declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly
(possibly with a byte swap).

Maybe it doesn't take any memory change, but there is a data type
change. A Unicode string cannot be sent over the network; an encoding
is needed.

In Pike, I can take a string like "\x20AC" (or "\u20ac" or
"\U000020ac", same thing) and manipulate it as a one-character string,
but I cannot write it to a file or file-like object. I can, however,
pass it through a codec (and there's string_to_utf8() for the
convenience of the common case), and get back something like
"\xe2\x82\xac", which is a three-byte string. The thing is, though,
that this new string is of exactly the same data type as the original:
'string'. Which means that I could have a string containing Latin-1
but not ASCII characters, and Pike will happily write it to a socket
without raising a compile-time or run-time error. Python, under the
same circumstances, would either raise an error or quietly (and
correctly) encode the data.

But this is a relatively trivial point, in the scheme of things.
Python has an excellent model now for handling Unicode strings, and I
would STRONGLY recommend everyone to upgrade to 3.3.

ChrisA
 
S

Steven D'Aprano

In the primordial days of computing, using 8 bits to store a character
was a profligate waste of memory. What on earth did people need with
TWO cases of the alphabet

That's obvious, surely? We need two cases so that we can distinguish
helping Jack off a horse from helping jack off a horse.

(not to mention all sorts of weird
punctuation)? Eventually, memory became cheap enough that the
convenience of using one character per byte (not to mention 8-bit bytes)
outweighed the costs. And crazy things like sixbit and rad-50 got swept
into the dustbin of history.

8 bit bytes are much older than 8 bit characters. For a long time, ASCII
characters used only 7 bits out of the 8.

So it may be with utf-8 someday.

Only if you believe that people's ability to generate data will remain
lower than people's ability to install more storage.

Every few years, new sizes for storage media comes out. The first thing
that happens is that people say "40 megabytes? I'll NEVER fill this hard
drive up!". The second thing that happens is that they say "Dammit, my 40
MB hard drive is full, and a new one is too expensive, better delete some
files." Followed shortly by "400 megabytes? I'll NEVER use that much
space!" -- wash, rinse, repeat, through megabytes, gigabytes, terrabytes,
and it will happen for petabytes next.

So long as our ability to outstrip storage continues, compression and
memory-efficient storage schemes will remain in demand.
 
R

Roy Smith

So it may be with utf-8 someday.

Only if you believe that people's ability to generate data will remain
lower than people's ability to install more storage.[/QUOTE]

We're not talking *data*, we're talking *text*. Most of those
whatever-bytes people are generating are images, video, and music. Text
is a pittance compared to those.

In any case, text on disk can easily be stored compressed. I would
expect the UTF-8 and UTF-32 versions of a text file to compress to just
about the same size.
 
M

Michael Torrie

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

Reading through this thread (which is entertaining), I am reminded of
the old saying, "premature optimization is the root of all evil." This
"problem" that you have discovered, if fixed the way you propose,
(4-byte USC-4 representation internally always) would be just such a
premature optimization. It would come at a high cost with very little
real-world impact.

As others have made abundantly clear, the overhead of changing internal
string representations is a cost that's only manifest during the
creation of the immutable string object. If your code is doing a lot of
operations on immutable strings, which by definition creates new
immutable string objects, then the real speed problem is in your
algorithm. If you are working on a string as if it were a buffer, doing
many searches, replaces, etc, then you need to work on an object
designed for IO, such as io.StringIO. If implemented half correctly, I
imagine that StringIO uses internally the widest possible character
representation in the buffer. I could be wrong here.

As to your other problem, Python generally tries to follow unicode
encoding rules to the letter. Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out. Other languages you have tried, likely fudge it
somehow. Display what they can, or something similar. In general the
Windows command window is an outdated thing that no serious programmer
can rely on to display unicode text. Use a proper GUI api, or use a
better terminal that can handle utf-8.

The TLDR version: You're right that converting string representations
internally incurs overhead, but if your program is slow because of this
you're doing it wrong. It's not symptomatic of some python disease.
 
S

Steven D'Aprano

Only if you believe that people's ability to generate data will remain
lower than people's ability to install more storage.

We're not talking *data*, we're talking *text*. Most of those
whatever-bytes people are generating are images, video, and music. Text
is a pittance compared to those.[/QUOTE]

Paul Rubin already told you about his experience using OCR to generate
multiple terrabytes of text, and how he would not be happy if that was
stored in UCS-4.

HTML is text. XML is text. SVG is text. Source code is text. Email is
text. (Well, it's actually bytes, but it looks like ASCII text.) Log
files are text, and they can fill a hard drive pretty quickly. Lots of
data is text.

Pittance or not, I do not believe that people will widely abandon compact
storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Given
that we're still trying to convince people to use UTF-8 over ASCII, I
reckon it will be at least 40 years before there's even a slim chance of
migrating from UTF-8 to UCS-4 in a widespread manner. In the IT world,
that's close enough to "never" -- we might not even be using Unicode in
2052.

In any case, time will tell who is right.
 
R

rusi

Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
Well, it seems some software producers know what they
are doing.


Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

<facetious>
You want the Euro-sign in iso-8859-1??
I object. I want the rupee sign ( ₹ ) http://en.wikipedia.org/wiki/Indian_rupee_sign

And while we are at it, why not move it (both?) into ASCII?
</facetious>

The problem(s) are:
1. We dont really understand what you are objecting to.
2. Utf-8 like Huffman coding is a prefix code
http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today
Like Huffman coding, it compresses based on a statistical argument.
3. Unlike Huffman coding the statistics is very political: "Is the
Euro more important or Chinese ideograms?" depends on whom you ask
 
P

Paul Rubin

Steven D'Aprano said:
Paul Rubin already told you about his experience using OCR to generate
multiple terrabytes of text, and how he would not be happy if that was
stored in UCS-4.

That particular text was stored on disk as compressed XML that had UTF-8
in the data fields, but I think Roy is right that it would have
compressed to around the same size in UCS-4. Converting it to UCS-4 on
input would have bloated up the memory footprint and that was the issue
of concern to me.
Pittance or not, I do not believe that people will widely abandon compact
storage formats like UTF-8 and Latin-1 for UCS-4 any time soon.

Looking at http://www.icu-project.org/ the C++ classes seem to use
UTF-16 sort like Python 3.2 :(. I'm not certain of this though.
 
O

Oscar Benjamin

Steven D'Aprano said:
Of course *if* k is constant, O(k) is constant too, but k is not
constant. In context we are talking about string indexing and slicing.
There is no value of k, say, k = 2, for which you can say "People will
sometimes ask for string[2] but never ask for string[3]". That is
absurd.


The context was parsing, e.g. recognizing a token like "a" or "foo" in a
human-written chunk of text. Occasionally it might be "sesquipidalian"
or some even worse outlier, but one can reasonably put a fixed and
relatively small upper bound on the expected value of k. That makes the
amortized complexity O(1), I think.

No it doen't. It is still O(k). The point of big O notation is to
understand the asymptotic behaviour of one variable as it becomes
large because of changes in other variables. If k is small then you
can often guess that O(k) is small. To say that an operation is O(k),
however, is a statement about what happens when k is big (and is not
refuted by saying that k is typically not big).

Oscar
 
R

Roy Smith

Michael Torrie said:
Python generally tries to follow unicode
encoding rules to the letter. Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out. Other languages you have tried, likely fudge it
somehow.

And if you want the "fudge it somehow" behavior (which is often very
useful!), there's always http://pypi.python.org/pypi/Unidecode/
 
P

Paul Rubin

Oscar Benjamin said:
No it doen't. It is still O(k). The point of big O notation is to
understand the asymptotic behaviour of one variable as it becomes
large because of changes in other variables.

Actually, two separate problems got mixed together late at night. In
neither case is k an independent variable that ranges over all possible
values. In both cases it is selected or observed by measurement (i.e.
it is a dependent variable determined by something that is itself not
independent).

1) Access in a rope: here, k is basically determined by the pointer size
of the computer, which in CPython (the implementation we're discussing)
the pointer size is 4 or 8 bytes (constants) in all instances AFAIK. k
should be a big enough that the pointer and allocation overhead is small
compared to bloating the strings with UCS-2 or UCS-4, and small enough
to not add much scan time. It seems realistic to say k<=128 for this
(several times smaller is probably fine). 128 is of course a constant
and not a variable. We are not concerned about hypothetical computers
with billion bit pointers.

2) Parsing tokens: here, k is drawn from a fixed probability
distribution such that its expectation value is again a small constant
(this is an assertion about the data looks like in typical parsing
applications in the real world, not what is mathematically possible).
The average-case complexity (I said "amortized" earlier but should have
said "average-case") is proportional to that constant, which is to say,
O(1). Of course there is still more wiggle room about what is
"typical".

Analogy: how big a box is required to hold a pair of shoes? In a purely
theoretical sense we might say O(S) where S is the shoe size, treating
shoe size as an arbitrary independent variable. But in the real world,
shoe size is controlled by the size of human feet, which is bounded by a
constant for biological reasons. You don't have to consider shoes the
size of Jupiter. So it is O(1).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top