X
Xah Lee
here's my experiences dealing with unicode in various langs.
Unicode Support in Ruby, Perl, Python, Emacs Lisp
Xah Lee, 2010-10-07
I looked at Ruby 2 years ago. One problem i found is that it does not
support Unicode well. I just checked today, it still doesn't. Just do
a web search on blog and forums on “ruby unicodeâ€. e.g.: Source,
Source, Source, Source.
Perl's exceedingly lousy unicode support hack is well known. In fact
it is the primary reason i “switched†to python for my scripting needs
in 2005. (See: Unicode in Perl and Python)
Python 2.x's unicode support is also not ideal. You have to declare
your source code with header like 「#-*- coding: utf-8 -*-ã€, and you
have to declare your string as unicode with “uâ€, e.g. 「u"林花è¬äº†æ˜¥ç´…"ã€. In
regex, you have to use unicode flag such as 「re.search(r'\.html
$',child,re.U)ã€. And when processing files, you have to read in with
「unicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
do「outF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
files, and if one of the file contains a bad char or doesn't use
encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code
print file names.
Also, if the output shell doesn't support unicode or doesn't match
with the encoding specified in your python print, you get gibberish.
It is often a headache to figure out the locale settings, what
encoding the terminal support or is configured to handle, the encoding
of your file, the which encoding the “print†is using. It gets more
complex if you are going thru a network, such as ssh. (most shells,
terminals, as of 2010-10, in practice, still have problems dealing
with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
Apple Terminal.))
Python 3 supposedly fixed the unicode problem, but i haven't used it.
Last time i looked into whether i should adopt python 3, but
apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
pissed that Python is going more and more into OOP mumbo jumbo with
lots ad hoc syntax (e.g. “viewsâ€, “iteratorsâ€, “list comprehensionâ€.))
I'll have to say, as far as text processing goes, the most beautiful
lang with respect to unicode is emacs lisp. In elisp code (e.g.
Generate a Web Links Report with Emacs Lisp ), i don't have to declare
none of the unicode or encoding stuff. I simply write code to process
string or buffer text, without even having to know what encoding it
is. Emacs the environment takes care of all that.
It seems that javascript and PHP also support unicode well, but i
don't have extensive experience with them. I suppose that elisp, php,
javascript, all support unicode well because these langs have to deal
with unicode in practical day-to-day situations.
Unicode Support in Ruby, Perl, Python, Emacs Lisp
Xah Lee, 2010-10-07
I looked at Ruby 2 years ago. One problem i found is that it does not
support Unicode well. I just checked today, it still doesn't. Just do
a web search on blog and forums on “ruby unicodeâ€. e.g.: Source,
Source, Source, Source.
Perl's exceedingly lousy unicode support hack is well known. In fact
it is the primary reason i “switched†to python for my scripting needs
in 2005. (See: Unicode in Perl and Python)
Python 2.x's unicode support is also not ideal. You have to declare
your source code with header like 「#-*- coding: utf-8 -*-ã€, and you
have to declare your string as unicode with “uâ€, e.g. 「u"林花è¬äº†æ˜¥ç´…"ã€. In
regex, you have to use unicode flag such as 「re.search(r'\.html
$',child,re.U)ã€. And when processing files, you have to read in with
「unicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
do「outF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
files, and if one of the file contains a bad char or doesn't use
encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code
print file names.
Also, if the output shell doesn't support unicode or doesn't match
with the encoding specified in your python print, you get gibberish.
It is often a headache to figure out the locale settings, what
encoding the terminal support or is configured to handle, the encoding
of your file, the which encoding the “print†is using. It gets more
complex if you are going thru a network, such as ssh. (most shells,
terminals, as of 2010-10, in practice, still have problems dealing
with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
Apple Terminal.))
Python 3 supposedly fixed the unicode problem, but i haven't used it.
Last time i looked into whether i should adopt python 3, but
apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
pissed that Python is going more and more into OOP mumbo jumbo with
lots ad hoc syntax (e.g. “viewsâ€, “iteratorsâ€, “list comprehensionâ€.))
I'll have to say, as far as text processing goes, the most beautiful
lang with respect to unicode is emacs lisp. In elisp code (e.g.
Generate a Web Links Report with Emacs Lisp ), i don't have to declare
none of the unicode or encoding stuff. I simply write code to process
string or buffer text, without even having to know what encoding it
is. Emacs the environment takes care of all that.
It seems that javascript and PHP also support unicode well, but i
don't have extensive experience with them. I suppose that elisp, php,
javascript, all support unicode well because these langs have to deal
with unicode in practical day-to-day situations.