unicode as valid naming symbols

Marko Rauhamaa · Apr 1, 2014

Chris Angelico said:
I don't find it more readable to cast something as recursive; compare
these two tight loops:

(let find-divisor ((c 2))
(cond
((= c i)
(format #t "~S\n" i)
(display-primes (1+ count) (1+ i)))
((= (remainder i c) 0)
(display-primes count (1+ i)))
(else
(find-divisor (1+ c)))))))))

for ( factor = 2 ; factor <= i - 1 ; factor++ )
if ( i%factor == 0 ) break;
if ( factor == i )
{
printf("%d\n",i);
count--;
}

In the first one, you start doing something, and if you don't have a
termination point, you recurse - which means you have to name this
loop as a function. In the second, you simply iterate,

I implemented the loops in the scheme way. Recursion is how iteration is
done by the Believers. Traditional looping structures are available to
scheme, but if you felt the need for them, you might as well program in
Python.

On the other hand, I didn't look for the most elegant implementation
idiom but tried to translate the original rather mechanically--in good
and bad.

My view is definitely that the C version is WAY more readable than the
Scheme one.

Yes, scheme is an acquired taste. As is Python. My experienced bash/C
colleague was baffled by some Python idioms (not in my code, I might
add) that looked pretty clear to me.

Marko

Chris Angelico · Apr 1, 2014

I implemented the loops in the scheme way. Recursion is how iteration is
done by the Believers. Traditional looping structures are available to
scheme, but if you felt the need for them, you might as well program in
Python.

Then I'm happily a pagan who uses while loops instead of recursion.
Why should every loop become a named function?

find_divisor: for ( factor = 2 ; i%factor ; factor++ )
{
if ( factor == i )
{
printf("%d\n",i);
count--;
break;
}
}

Does that label add anything? If you really need to put a name to
every loop you ever write, there's something wrong with the code; some
loops' purposes should be patently obvious by their body. All you do
is add duplicate information that might be wrong.

ChrisA

Ned Batchelder · Apr 1, 2014

That's reasonable. The Pc category doesn't have much in it:

http://www.fileformat.info/info/unicode/category/Pc/list.htm

If the definition of "characters permitted in identifiers" is derived
exclusively from the Unicode categories, including Pc would make fine
sense. Probably the definition should be: First character is L* or Pc,
subsequent characters are L*, N*, or Pc, and either Mn or M*
(combining characters). Or something like that.

Maybe I'm misunderstanding the discussion... It seems like we're talking
about a hypothetical definition of identifiers based on Unicode
character categories, but there's no need: Python 3 has defined
precisely that. From the docs
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers):

---<snip>---------

Python 3.0 introduces additional characters from outside the ASCII range
(see PEP 3131). For these characters, the classification uses the
version of the Unicode Character Database as included in the unicodedata
module.

Identifiers are unlimited in length. Case is significant.

identifier ::= xid_start xid_continue*
id_start ::= <all characters in general categories Lu, Ll, Lt, Lm,
Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue ::= <all characters in id_start, plus characters in the
categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start ::= <all characters in id_start whose NFKC normalization
is in "id_start xid_continue*">
xid_continue ::= <all characters in id_continue whose NFKC
normalization is in "id_continue*">

The Unicode category codes mentioned above stand for:

Lu - uppercase letters
Ll - lowercase letters
Lt - titlecase letters
Lm - modifier letters
Lo - other letters
Nl - letter numbers
Mn - nonspacing marks
Mc - spacing combining marks
Nd - decimal numbers
Pc - connector punctuations
Other_ID_Start - explicit list of characters in PropList.txt to
support backwards compatibility
Other_ID_Continue - likewise

All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.

Rustom Mody · Apr 1, 2014

I'd really rather not have a drastically different concept of "name"
to every other language's definition! Reading over COBOL code is
confusing in ways that reading, say, Ruby code isn't; the ? and !
suffixes aren't nearly as confusing as:

http://www.math-cs.gordon.edu/courses/cs323/COBOL/cobol.html
"""
COBOL identifers are 1-30 alphanumeric characters, at least one of
which must be non-numeric.
In certain contexts it is permissible to use a totally numeric
identifier; however, that usage
is discouraged. Hyphens may be included in an identifier anywhere
except the first of last
character.
"""

Hyphens in names! Ugh! That means subtraction!

Just temporarily switch to a domain other than programming --
one that has not been under the absolute hegemony of ASCII for 40 years
and you may get different results -- See 1st item from here:
http://searchengineland.com/9-seo-quirks-you-should-be-aware-of-146465

Chris Angelico · Apr 1, 2014

Maybe I'm misunderstanding the discussion... It seems like we're talking
about a hypothetical definition of identifiers based on Unicode character
categories, but there's no need: Python 3 has defined precisely that. From
the docs
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers):

"Python 3.0 introduces **additional characters** from outside the
ASCII range" - emphasis mine.

Python currently has - at least, per that documentation - a hybrid
system with ASCII characters defined in the classic way, and non-ASCII
characters defined by their Unicode character classes. I'm talking
about a system that's _purely_ defined by Unicode character classes.
It may turn out that the class list exactly compasses the ASCII
characters listed, though, in which case you'd be right: it's not
hypothetical.

In any case, Pc is included, which I should have checked beforehand.
So that part is, as you say, not hypothetical. Go for it! Use 'em.

ChrisA

Rustom Mody · Apr 1, 2014

"Python 3.0 introduces **additional characters** from outside the
ASCII range" - emphasis mine.

Python currently has - at least, per that documentation - a hybrid
system with ASCII characters defined in the classic way, and non-ASCII
characters defined by their Unicode character classes. I'm talking
about a system that's _purely_ defined by Unicode character classes.
It may turn out that the class list exactly compasses the ASCII
characters listed, though, in which case you'd be right: it's not
hypothetical.

In any case, Pc is included, which I should have checked beforehand.
So that part is, as you say, not hypothetical. Go for it! Use 'em.

Dunno if you really mean it or are just saying...

Steven gave the example the other day of confusing the identifiers
A and Ð. There must be easily hundreds (thousands?) of other such confusables.

So you think thats nice and APL(-ese), Scheme(-ish) is not...???

Confused by your stand...

Personally I dont believe that unicode has been designed with
programming languages in mind.

Assuming that unicode categories will naturally and easily fit
programming language lexical/syntax categories is rather naive.

Ian Kelly · Apr 1, 2014

"Python 3.0 introduces **additional characters** from outside the
ASCII range" - emphasis mine.

Python currently has - at least, per that documentation - a hybrid
system with ASCII characters defined in the classic way, and non-ASCII
characters defined by their Unicode character classes. I'm talking
about a system that's _purely_ defined by Unicode character classes.
It may turn out that the class list exactly compasses the ASCII
characters listed, though, in which case you'd be right: it's not
hypothetical.

The only ASCII character not encompassed is that _ is explicitly
permitted to start an identifier (for obvious reasons) whereas
characters in Pc are more generally only permitted to continue
identifiers.

There are also explicit lists of extra permitted characters in
PropList.txt for backward compatibility (once a character is
permitted, it should remain permitted even if its Unicode category
changes). There are currently 4 extra starting characters and 12
extra continuing characters, but none of these are ASCII.

Marko Rauhamaa · Apr 1, 2014

Chris Angelico said:
Then I'm happily a pagan who uses while loops instead of recursion.
Why should every loop become a named function?

Every language has its idioms. The principal aesthetic motivation for
named-let loops is the avoidance of (set!), I think. Secondarily, you
get to shift gears in the middle of your loops; something you can often,
but not always, accomplish in Python with break, return and continue.

Don't take me wrong. Python has its own idioms and avoiding loops in
Python would be equally blasphemous. In C++ you avoid void pointers like
the plague, in C you celebrate them.

Marko

Rustom Mody · Apr 2, 2014

Chris Angelico :

Every language has its idioms. The principal aesthetic motivation for
named-let loops is the avoidance of (set!), I think. Secondarily, you
get to shift gears in the middle of your loops; something you can often,
but not always, accomplish in Python with break, return and continue.

You are forgetting the main point: In scheme, in a named-let, the name
chosen was very often 'loop' (if I remember the PC scheme manuals
correctly). IOW if you had a dozen loops implemented with
named-letted-tail-recursion, you could call all of them 'loop'. How
is that different from calling all of them 'while' or 'for' ?

Don't take me wrong. Python has its own idioms and avoiding loops in
Python would be equally blasphemous. In C++ you avoid void pointers like
the plague, in C you celebrate them.

Yeah... I guess that is the issue.
People brought up on imperative (which includes OO) programming, think
recursion and iteration are fundamentally different, just as assembly
language programmers think of memory and register as fundamentally
different. Sure is but if you are a C programmer the distinction is
irrelevant 99% of the time!

Continues downward... For an assembly language programmer, memory and
cache-memory is not a distinction he needs to make 99% of the time. Not so for
the hardware engineer

Rustom Mody · Apr 2, 2014

You are forgetting the main point: In scheme, in a named-let, the name
chosen was very often 'loop' (if I remember the PC scheme manuals
correctly). IOW if you had a dozen loops implemented with
named-letted-tail-recursion, you could call all of them 'loop'. How
is that different from calling all of them 'while' or 'for' ?

Umm... I see from your prime number example that there are nested loops
in which sometimes you restart the inner and sometimes the outer.
So you could not possibly call both of them 'loop'

.

So "you could call all of them 'loop'" is over-statement.
"Good many" may be more appropriate?

Marko Rauhamaa · Apr 2, 2014

Rustom Mody said:
Umm... I see from your prime number example that there are nested
loops in which sometimes you restart the inner and sometimes the
outer. So you could not possibly call both of them 'loop' .

Correct. I could call them "inner" and "outer". After all, the code uses
variables like "i", "c" and "n".

However, it doesn't hurt to use variable/function/loop names that convey
meaning.

Marko

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode help please	5	Oct 19, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Chatbot	0	Oct 8, 2024
byte count unicode string	0	Sep 21, 2006
Demystifying Symbols.	23	Jan 5, 2006
Python's handling of unicode surrogates	17	Apr 20, 2007

unicode as valid naming symbols

Marko Rauhamaa

Chris Angelico

Ned Batchelder

Rustom Mody

Chris Angelico

Rustom Mody

Ian Kelly

Marko Rauhamaa

Rustom Mody

Rustom Mody

Marko Rauhamaa

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads