F
Fábio Santos
Does this mean he passes the Turing test?
Does this mean he passes the Turing test?
Yes, absolutely. The original Turing test was defined in terms of five
minutes of analysis, and Dihedral and jmf have clearly been
indistinguishably human across that approximate period.
Thank you everyone for the suggestions. I have not tried them yet.FYI, you're better off going to http://pypi.python.org/pypi/regex
because that will take you to the latest version.
Am 07/12/2013 07:16 PM, schrieb MRAB:
Thank you everyone for the suggestions. I have not tried them yet.
Devyn Collier Johnson
Variable number of bytes is a problematic way to saying it. UTF-8 is a
variable-number-of-bytes encoding scheme where each character can be 1,
2, 4, or more bytes, depending on the unicode character. As you can
imagine this sort of encoding scheme would be very slow to do slicing
with (looking up a character at a certain position). Python uses
fixed-width encoding schemes, so they preserve the O(n) lookup speeds,
but python will use 1, 2, or 4 bytes per every character in the string,
depending on what is needed. Just in case the OP might have
misunderstood what you are saying.
jmf sees the case where a string is promoted from one width to another,
and thinks that the brief slowdown in string operations to accomplish
this is a problem. In reality I have never seen anyone use the types of
string operations his pseudo benchmarks use, and in general Python 3's
string behavior is pretty fast. And apparently much more correct than
if jmf's ideas of unicode were implemented.
Short example. Writing an editor with something like the
FSR is simply impossible (properly).
I've screwed up plenty of times in python, but can write code like a pro
when I'm feeling better(on SSI and medicaid). An editor can be built simply,
but it's preference that makes the difference. Some might have used tkinter,
gtk. wxpython or other methods for the task.
I think the main issue in responding is your library preference, or widget
set preference. These can make you right with some in your response, or
wrong with others that have a preferable gui library that coincides with
one's personal cognitive structure that makes t
Sorry, you are not understanding Unicode. What is a Unicode
Transformation Format (UTF), what is the goal of a UTF and
why it is important for an implementation to work with a UTF.
Short example. Writing an editor with something like the
FSR is simply impossible (properly).
Really? Enlighten me.
Personally, I would never use UTF as a representation *in memory* for a
unicode string if it were up to me. Why? Because UTF characters are
not uniform in byte width so accessing positions within the string is
terribly slow and has to always be done by starting at the beginning of
the string. That's at minimum O(n) compared to FSR's O(1). Surely you
understand this. Do you dispute this fact?
Frankly, Python's strings are a *terrible* internal representation
for an editor widget - not because of PEP 393, but simply because
they are immutable, and every keypress would result in a rebuilding
of the string. On the flip side, I could quite plausibly imagine
using a list of strings; whenever text gets inserted, the string gets
split at that point, and a new string created for the insert (which
also means that an Undo operation simply removes one entire string).
In this usage, the FSR is beneficial, as it's possible to have
different strings at different widths.
But mainly, I'm just wondering how many people here have any basis
from which to argue the point he's trying to make. I doubt most of
us have (a) implemented an editor widget, or (b) tested multiple
different internal representations to learn the true pros and cons
of each.
Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
pros and cons,
and the cons of using UCS-2 (the old narrow builds) are
well known. UCS-2 simply cannot represent all of unicode correctly.
Does this mean he passes the Turing test?
I used exactly this, a list of strings, for a Python-coded text-only mock
editor to replace the tk Text widget in idle tests. It works fine for the
purpose. For small test texts, the inefficiency of immutable strings is not
relevant.
Tk apparently uses a C-coded btree rather than a Python list. All details
are hidden, unless one finds and reads the source ;-), but but it uses C
arrays rather than Python strings.
For my purpose, the mock Text works the same in 2.7 and 3.3+.
They both have the pro that indexing is direct *and correct*. The cons are
different.
Python's narrow builds, at least for several releases, were in between USC-2
and UTF-16 in that they used surrogates to represent all unicodes but did
not correct indexing for the presence of astral chars. This is a nuisance
for those who do use astral chars, such as emotes and CJK name chars, on an
everyday basis.
Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.
If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so.
To my mind, exposing UTF-16 surrogates to the application is a bug
to be fixed, not a feature to be maintained.
But since we can get the best of both worlds with only
a small amount of overhead, I really don't see why anyone should be
objecting.
It is definitely not a feature, but a proper UTF-16 implementation would not
expose them except to codecs, just as with the PEP 393 implementation. (In
both cases, I am excluding the sys size function as 'exposing to the
application'.)
I presume you are referring to the PEP 393 1-2-4 byte implementation. Given
how well it has been optimized, I think it was the right choice for Python.
But a language that now uses USC2 or defective UTF-16 on all platforms might
find the auxiliary array an easier fix.
I'm referring here to objections like jmf's, and also to threads like this:
http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html
According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)
I don't fully understand
why making strings simply "unicode" in javascript breaks compatibility
with older scripts. What operations are performed on strings that
making unicode an abstract type would break?
Heh... The Grey Wolf (1999 Jeep Cherokee) is "female", but theIn absence of specific information to the contrary, I'll usually refer
to computers and programs by masculine pronouns; no particular reason
for it, just the same sort of convention that has most ships and boats
being "she".
To my mind, exposing UTF-16
surrogates to the application is a bug to be fixed, not a feature to
be maintained.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.