Unicode Support in Ruby, Perl, Python, Emacs Lisp

Xah Lee · Oct 7, 2010

here's my experiences dealing with unicode in various langs.

Unicode Support in Ruby, Perl, Python, Emacs Lisp

Xah Lee, 2010-10-07

I looked at Ruby 2 years ago. One problem i found is that it does not
support Unicode well. I just checked today, it still doesn't. Just do
a web search on blog and forums on â€œruby unicodeâ€. e.g.: Source,
Source, Source, Source.

Perl's exceedingly lousy unicode support hack is well known. In fact
it is the primary reason i â€œswitchedâ€ to python for my scripting needs
in 2005. (See: Unicode in Perl and Python)

Python 2.x's unicode support is also not ideal. You have to declare
your source code with header like ã€Œ#-*- coding: utf-8 -*-ã€, and you
have to declare your string as unicode with â€œuâ€, e.g. ã€Œu"æž—èŠ±è¬äº†æ˜¥ç´…"ã€. In
regex, you have to use unicode flag such as ã€Œre.search(r'\.html
$',child,re.U)ã€. And when processing files, you have to read in with
ã€Œunicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
doã€ŒoutF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
files, and if one of the file contains a bad char or doesn't use
encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code
print file names.

Also, if the output shell doesn't support unicode or doesn't match
with the encoding specified in your python print, you get gibberish.
It is often a headache to figure out the locale settings, what
encoding the terminal support or is configured to handle, the encoding
of your file, the which encoding the â€œprintâ€ is using. It gets more
complex if you are going thru a network, such as ssh. (most shells,
terminals, as of 2010-10, in practice, still have problems dealing
with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
Apple Terminal.))

Python 3 supposedly fixed the unicode problem, but i haven't used it.
Last time i looked into whether i should adopt python 3, but
apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
pissed that Python is going more and more into OOP mumbo jumbo with
lots ad hoc syntax (e.g. â€œviewsâ€, â€œiteratorsâ€, â€œlist comprehensionâ€.))

I'll have to say, as far as text processing goes, the most beautiful
lang with respect to unicode is emacs lisp. In elisp code (e.g.
Generate a Web Links Report with Emacs Lisp ), i don't have to declare
none of the unicode or encoding stuff. I simply write code to process
string or buffer text, without even having to know what encoding it
is. Emacs the environment takes care of all that.

It seems that javascript and PHP also support unicode well, but i
don't have extensive experience with them. I suppose that elisp, php,
javascript, all support unicode well because these langs have to deal
with unicode in practical day-to-day situations.

Bigos · Oct 9, 2010

here's my experiences dealing with unicode in various langs.

Unicode Support in Ruby, Perl, Python, Emacs Lisp

Xah Lee, 2010-10-07

I looked at Ruby 2 years ago. One problem i found is that it does not
support Unicode well. I just checked today, it still doesn't. Just do
a web search on blog and forums on â€œruby unicodeâ€. e.g.: Source,
Source, Source, Source.

Perl's exceedingly lousy unicode support hack is well known. In fact
it is the primary reason i â€œswitchedâ€ to python for my scripting needs
in 2005. (See: Unicode in Perl and Python)

Python 2.x's unicode support is also not ideal. You have to declare
your source code with header like ã€Œ#-*- coding: utf-8 -*-ã€, and you
have to declare your string as unicode with â€œuâ€, e.g. ã€Œu"æž—èŠ±è¬äº†æ˜¥ç´…"ã€. In
regex, you have to use unicode flag such as ã€Œre.search(r'\.html
$',child,re.U)ã€. And when processing files, you have to read in with
ã€Œunicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
doã€ŒoutF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
files, and if one of the file contains a bad char or doesn't use
encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code
print file names.

Also, if the output shell doesn't support unicode or doesn't match
with the encoding specified in your python print, you get gibberish.
It is often a headache to figure out the locale settings, what
encoding the terminal support or is configured to handle, the encoding
of your file, the which encoding the â€œprintâ€ is using. It gets more
complex if you are going thru a network, such as ssh. (most shells,
terminals, as of 2010-10, in practice, still have problems dealing
with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
Apple Terminal.))

Python 3 supposedly fixed the unicode problem, but i haven't used it.
Last time i looked into whether i should adopt python 3, but
apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
pissed that Python is going more and more into OOP mumbo jumbo with
lots ad hoc syntax (e.g. â€œviewsâ€, â€œiteratorsâ€, â€œlist comprehensionâ€.))

I'll have to say, as far as text processing goes, the most beautiful
lang with respect to unicode is emacs lisp. In elisp code (e.g.
Generate a Web Links Report with Emacs Lisp ), i don't have to declare
none of the unicode or encoding stuff. I simply write code to process
string or buffer text, without even having to know what encoding it
is. Emacs the environment takes care of all that.

It seems that javascript and PHP also support unicode well, but i
don't have extensive experience with them. I suppose that elisp, php,
javascript, all support unicode well because these langs have to deal
with unicode in practical day-to-day situations.

Maybe you have checked wrong version. There two versions of Ruby out
there one does support unicode and the other doesn't. Latest version
ie. 1.9.x branch has made some progress in that regard. Please check
the following links to see if the solve your problem.

http://nuclearsquid.com/writings/ruby-1-9-encodings.html
http://loopkid.net/articles/2008/07/07/ruby-1-9-utf-8-mostly-works
http://stackoverflow.com/questions/1627767/rubys-stringgsub-unicode-and-non-word-characters

I think latest recommended version of Ruby is ruby 1.9.2p0, please try
it to see if it works for you. Of course it is not as good as Lisp,
and in Rails code you see people writing the same sequences of
characters over and over again, but some people like it because it is
better than other languages they used before. If it's a stepping stone
towards Lisp then it is a good thing imho.

Xah Lee · Oct 10, 2010

2010-10-09

I think your assessment is antiquated. Â I've been doing Unicode
programming with Perl for about three years, and it's generally quite
wonderfully transparent.

you are probably right. The last period i did serious perl is 1998 to
2004. Since, have pretty much lost contact with perl community.

i have like 5 years of 8 hours day experience with perl... the app we
wrote is probably the largest perl web app at the time, say within the
top 10 largest perl web apps, during the dot com days.

spend 2 years with python about 2005, 2006, but mostly just personal
dabbling.

my dilema is this... i am really tired of perl, so i thougth python is
my solution. Comparing the syntax, semantics, etc, i really do find
python better, but to know python as well as i know perl, or, to know
a lang really as a expert (e.g. intimately familiar with all the ins
and outs of constructs, idioms, their speeds, libraries out there,
their nature, which are used, their bugs etc), takes years. So,
whenever i have this psychological urge to totally ditch perl and hug
python 100% ... but it takes a huge amount of time to dig into a lang
well again, so sometimes i thought of sticking with my perl due to my
existing knowledge and forthwith stop wasting valuable time, but then,
whenever i work in perl with its hack nature and crooked community
(all those mongers ****), especially the syntax for nested list/hash
that's more than 3 levels (and my code almost always rely on nested
list/hash to do things since am a functional programer), and compare
to python's syntax on nested structure, i ask my self again, is this
shit really what i want to keep on at?

and python 3 comes in, and over the years i learned, that Guido really
hates functional programing (he understands it nil), and python is
moving more innto oop mumbo jumbo with more special syntaxes and
special semantics. (and perl is trivially far more capable at
functional programing than python) So, this puts a damnation in my
mental struggle for python.

in the end i really haven't decided on anything, as usual... it's not
really concrete, answerable question anyway, it's just psy struggle on
some fuzzy ideal about efficiency and perfect lang.

and there's ruby... (among others) and because i'm such a douchbag for
langs, now and then i suppose i waste my time to venture and read
about ruby, the unconcious execuse is that maybe ruby will turn out to
simply solve all my life's problems, but nagging in the back of my
mind is the reality that, yeah, go spend 3 years 8 hours a day on
ruby, then possibly it'll be practically useful to me as i do with
perl already, and, no, it won't bring you anything extra as far as
lang goes, for that you go to OCaml/F#, erlang, Mathematica ... and
who knows what kinda hidden needle in the eye i'll discover on my road
in ruby.

btw, this is all just a geek's mental disorder, common with many who's
into lang design and beauty etc type of shit. (high percentage of this
crowd hang in newsgroups) But the reality is that, this psychological
problem really don't have much practical justification ... it's just
fret, fret, fret. Fret, fret, fret. Years of fretting, while others
have written great apps all over the web.

in practice, i do not even have a need for perl or python in my work
since about 2006, except a few find/replace scripts for text
processing that i've written in the past. And, since about 2007, i've
been increasingly writing lots and lots more in elisp. (and this emacs
beast, is really a true love more than anything) So these days, almost
all of my scripts are in elisp. (and my job these days is mainly just
text processing programing)

â€¢ ã€ˆXah on Programing Languagesã€‰
http://xahlee.org/Periodic_dosage_dir/comp_lang.html

On the programmers' web site stackoverflow.com, I flag questions with
the "unicode" tag, and of questions that mention a specific language,
Python and C++ seem to come up the most often.

It's not quite perfect, though. Â I recently discovered that if I enter a
Chinese character using my Mac's Chinese input method, and then enter
the same character using a Japanese input method, Emacs regards them as
different characters, even though they have the same Unicode code point.
For example, from describe-char:

Â character: ä¸€ (43323, #o124473, #xa93b, U+4E00)
Â character: ä¸€ (55404, #o154154, #xd86c, U+4E00)

that's because you are using pre emacs 23. Try to switch to emacs 23,
it uses utf-8 to represent chars internally.

On saving and reverting a file containing such text, the characters are
"normalized" to the Japanese version.

I suppose this might conceivably be the correct behavior, but it sure
was a surprise that (equal "ä¸€" "ä¸€") can be nil.

(equal "ä¸€" "ä¸€")

with emacs 23.*, this eval to true.

â€¢ ã€ˆNew Features in Emacs 23ã€‰
http://xahlee.org/emacs/emacs23_features.html

â€¢ ã€ˆEmacs and Unicode Tipsã€‰
http://xahlee.org/emacs/emacs_n_unicode.html

â€¢ ã€ˆAll about Unicodeã€‰
http://xahlee.org/Periodic_dosage_dir/unicode.html

Xah âˆ‘ xahlee.org â˜„

Steven D'Aprano · Oct 10, 2010

]

Maybe you have checked wrong version. There two versions of Ruby out
there one does support unicode and the other doesn't.

Please don't feed the trolls. Xah Lee is a known troll who cross-posts to
irrelevant newsgroups with his blatherings. He is not interested in
learning anything which challenges his opinions, and rarely if every
engages in dialog with those who respond.

Since your reply has little or nothing to do with the newsgroups you have
sent it to, it is also spamming. While we're all extremely impressed by
your assertion that Lisp is the bestest programming language evar, please
keep your fan-boy gushing to comp.lang.lisp and don't cross-post again.

Followups to /dev/null.

David Kastrup · Oct 10, 2010

Sean McAfee said:
I think your assessment is antiquated. I've been doing Unicode
programming with Perl for about three years, and it's generally quite
wonderfully transparent.

On the programmers' web site stackoverflow.com, I flag questions with
the "unicode" tag, and of questions that mention a specific language,
Python and C++ seem to come up the most often.

It's not quite perfect, though. I recently discovered that if I enter a
Chinese character using my Mac's Chinese input method, and then enter
the same character using a Japanese input method, Emacs regards them as
different characters, even though they have the same Unicode code point.
For example, from describe-char:

character: ä¸€ (43323, #o124473, #xa93b, U+4E00)
character: ä¸€ (55404, #o154154, #xd86c, U+4E00)

On saving and reverting a file containing such text, the characters are
"normalized" to the Japanese version.

I suppose this might conceivably be the correct behavior, but it sure
was a surprise that (equal "ä¸€" "ä¸€") can be nil.

Your headers state:

User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

That's an old version of Emacs, more than 2 years old. 23.1 has been
released more than a year ago. The current version is 23.2.

Nobody · Oct 10, 2010

It's not quite perfect, though. I recently discovered that if I enter a
Chinese character using my Mac's Chinese input method, and then enter
the same character using a Japanese input method, Emacs regards them as
different characters, even though they have the same Unicode code point.
For example, from describe-char:

character: ä¸€ (43323, #o124473, #xa93b, U+4E00)
character: ä¸€ (55404, #o154154, #xd86c, U+4E00)

On saving and reverting a file containing such text, the characters are
"normalized" to the Japanese version.

I don't know about GNU Emacs, but XEmacs doesn't use Unicode internally,
it uses byte-strings with associated encodings. Some of us like it that
way, as converting to Unicode may not be reversible, and it's often
important to preserve exact byte sequences.

FWIW, I'd expect Ruby to have worse support for Unicode, as its creator is
Japanese. Unicode is still far more popular in locales which historically
used ASCII or "almost ASCII" (e.g. ISO-646-*, ISO-8859-*) encodings than
in locales which had to use a radically different encoding.

Steven D'Aprano · Oct 10, 2010

On Sun, 10 Oct 2010 11:34:02 +0200, David Kastrup wrote:
[unnecessary quoting removed]

Your headers state:

User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

Please stop spamming multiple newsgroups. I'm sure this is of great
interest to the Emacs newsgroup, but not of Python.

Followups to /dev/null.

emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Emacs Lisp vs Perl: Validate Local File Links	1	Apr 13, 2012
Using lisp code in emacs inside a C program	3	Oct 25, 2012
Is Unicode support so hard...	12	Apr 20, 2013
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Set up python in emacs	0	Feb 12, 2016
emacs lisp as text processing language...	1	Oct 29, 2007
Curses unicode support	1	Sep 1, 2012

Unicode Support in Ruby, Perl, Python, Emacs Lisp

Xah Lee

Bigos

Xah Lee

Steven D'Aprano

David Kastrup

Nobody

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads