| Τη Î Îμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χÏήστης Steven D'Aprano ÎγÏαψε:
| > py> s = '999-Eυχή-του-ΙησοÏ'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t)
| > 999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ
|
| errors='replace' mean dont break in case or error?
Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.
| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?
It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.
| > So that demonstrates part of your problem: even though your Linux system
| > is using UTF-8, your terminal is probably set to ISO-8859-7. The
| > interaction between these will lead to strange and disturbing Unicode
| > errors.
|
| Yes i feel this is the problem too.
| Its a wonder to me why putty used by default greek-iso instead of utf-8 !!
Putty will get its terminal setting from the system you came from.
I suppose Windows of some kind. If you look at Putty's settings you
may be able to specify UTF-8 explicitly; not sure. If you can, do
that. At least there will be one less layer of confusion to debug.
| Please explain this t me because now that i begin to understand
| this encode/decode things i begin to like them!
|
| a) WHAT does it mean when a linux system is set to use utf-8?
It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state. There
will also be some system settings with defaults for stuff started
up by the system. On CentOS and RedHat that is probably the file:
/etc/sysconfig/i18n
_However_, when you ssh in to the system using Putty or another ssh
client, the settings at your local end are passes to the remote ssh
session. In this way different people using different locales can
ssh in and get the locales they expect to use.
Of course, of the locale settings differ and these people are working
on the same files and text, madness will ensue.
| b) WHAT does it mean when a terminal client is set to use utf-8?
It means the _display_ end of the terminal will render characters
using UTF-8. Data comes from the remote system as a sequence of
bytes. The terminal receives these bytes and _decodes_ them using
utf-8 (or whatever) in order to decides what characters to display.
| c) WHAT happens when the two of them try to work together?
If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.
| > So I believe I understand how your file name has become garbage. To fix
| > it, make sure that your terminal is set to use UTF-8, and then rename it.
| > Do the same with every file in the directory until the problem goes away.
|
| (e-mail address removed) [~/www/cgi-bin]# echo $LS_OPTIONS
| --color=tty -F -a -b -T 0
|
| Is this okey? The '-b' option is for to display a filename in binary mode?
Probably. "man ls" will tell you.
Personally, I "unalias ls" on RedHat systems (and any other system
where an alias has been set up). I want ls to do what I say, not
what someone else thought was a good idea.
| Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays
| the file in correct greek letters. Switching putty's encoding back
| to 'greek-iso' then the *displayed* filanames shows in mojabike.
Exactly so.
| WHAT is being displayed and what is actually stored as bytes is two different thigns right?
Yes. Display requires the byte stream to be decoded. Wrong decoding
display wrong characters/glyphs.
| Ευχη του Ιησου.mp3
| EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ
|
| is the way the filaname is displayed in the terminal depending
| on the encoding the terminal uses, correct? But no matter *how* its
| being dislayed those two are the same file?
In principle, yes. Nothing has changed on the filesystem itself.
Cheers,
--
Cameron Simpson <
[email protected]>
You write code in a proportional serif? No wonder you got extra
semicolons falling all over the place.
No, I *dream* about writing code in a proportional serif font.
It's much more exciting than my real life.
/* dan: THE Anti-Ged -- Ignorant Yank (tm) #1, none-%er #7 */
Dan Nitschke (e-mail address removed) (e-mail address removed)