Changing filenames from Greeklish => Greek (subprocess complain)

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
C

Carlos Nepomuceno

Date: Tue, 4 Jun 2013 18:28:17 -0700
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
From: (e-mail address removed) [...]
Just a reminder to everyone that the OP originally went by the name of
Ferrous Cranus:
http://redwing.hutman.net/~mreed/warriorshtm/ferouscranus.htm

He's told there's a missing parenthesis, he dismisses the claim. He's
given code that demonstrates the missing parenthesis, and he acts
confused. The list is rapidly becoming his support group for _his
business_, and the bulk of it has very little to do with Python
itself.

I've been struggling for a month to get an inheritance chain working
with fresnel lenses, should I be posting every single bug I hit here
every 10 minutes then bump them 10 minutes later when no one responds?
Is that what the list is for now? We don't do people's home work for
them, so why are we doing his _work_ for him?

I've had once this naive expectation that his obduracy would end! lol
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 4:28:17 ð.ì.UTC+3, ï ÷ñÞóôçò alex23 Ýãñáøå:
Lele the output of:

stmt = "cur.execute('''SELECT url FROM files WHERE url = %s''', ( fullpath, )"
chars_count = Counter(stmt)
print("Number of '(': %d" % chars_count['('])
print("Number of ')': %d" % chars_count[')'])

Number of '(': 2 Number of ')': 1
What do you make out of this please?



Just a reminder to everyone that the OP originally went by the name of

Ferrous Cranus:

http://redwing.hutman.net/~mreed/warriorshtm/ferouscranus.htm



He's told there's a missing parenthesis, he dismisses the claim. He's

given code that demonstrates the missing parenthesis, and he acts

confused. The list is rapidly becoming his support group for _his

business_, and the bulk of it has very little to do with Python

itself.



I've been struggling for a month to get an inheritance chain working

with fresnel lenses, should I be posting every single bug I hit here

every 10 minutes then bump them 10 minutes later when no one responds?

Is that what the list is for now? We don't do people's home work for

them, so why are we doing his _work_ for him?

AS you have seen i've been struggling days now to get a solution to this and the closing parenthesis is not the prbpoem here, unicode.

YOU of all people should not speak at all, because you haven't helped me a bit.
Its funny, how knowledge people that in facte tried to help me treat me with respect while people like you who have never been of any help tend to just bitch all the way along.
 
C

Chris Angelico

YOU of all people should not speak at all, because you haven't helped me a bit.
Its funny, how knowledge people that in facte tried to help me treat me with respect while people like you who have never been of any help tend to just bitch all the way along.


You sure don't know respect when you don't see it.

ChrisA
 
Í

Íéêüëáïò Êïýñáò

Ôç Ôñßôç, 4 Éïõíßïõ 2013 10:31:20 ì.ì. UTC+3, ï ÷ñÞóôçò Lele Gaifax Ýãñáøå:
The code above was my (failed) attempt to focus your attention on why
one of your scripts raised a SyntaxError: translating that code in plain
english, that line (the "stmt" variable above) contains *two* open
brackets, and *one* close bracket.

Lele, iam sorry fot that these days i do nothing, all day long but try to solve 2 issues, one of it being fils.py which this encoding issues. i missedthe parentheses because i was tired. Just added it.

I believe that in order to be able to solve this i have to

a) Find out the actual encoding of my greek filenames are into, after the rename took place from english to greek chars at the CentOS. How can i checkthat

b) Findind out (a) will help tell python to decode 'fullpath' from the weird unknown yet to be discovered encoded bytestream to 'utf-8' like:

cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath.decode('weird_bytestream') ) )

Is this the right aproach? I went to sleep yesterday and my mind was still bothered with this encoding problem i'm dealing with.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 12:47:17 ð.ì.. UTC+3, ï ÷ñÞóôçò Chris Angelico Ýãñáøå:
For some reason you have an invalid Unicode codepoint in your string. Fixthat.

Can you be more clear please?
my string is "Åõ÷Þ ôïõ Éçóïý.mp3". Just a Greek filename with spaces.
Is there a problem when a filename contain both english and greek letters?
Isn't it still a unicode stream?
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 6:44:38 ð.ì.UTC+3, ï ÷ñÞóôçò Íéêüëáïò Êïýñáò Ýãñáøå:
Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 12:47:17 ð.ì. UTC+3, ï ÷ñÞóôçò Chris Angelico Ýãñáøå:
Can you be more clear please?
my string is "Åõ÷Þ ôïõ Éçóïý.mp3". Just a Greek filename with spaces.
Is there a problem when a filename contain both english and greek letters?
Isn't it still a unicode stream?


I can't actually check what the actual encoding of a filename stored in hddis. It should be UTF-8, but it is not. It's probably whatever encoding i had on Windows. Perhaps making sure "root" and "fullpath" are bytes. Then the returned filenames should be bytes as well?

Is this achievable by doing?
print( root.decode('utf-8'), fullpath.decode('utf-8') )
 
Í

Íéêüëáïò Êïýñáò

One of my Greek filenames is "Åõ÷Þ ôïõ Éçóïý.mp3".
Just a Greek filename with spaces.
Is there a problem when a filename contain both english and greek letters?
Isn't it still a unicode string?

All i did in my CentOS was 'mv "Euxi tou Ihsou.mp3" "Åõ÷Þ ôïõÉçóïý.mp3"

and the displayed filename after 'ls -l' returned was:

is -rw-r--r-- 1 nikos nikos 3511233 Jun 4 14:11 \305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3

There is no way at all to check the charset used to store it in hdd?
It should be UTF-8, but it doesn't look like it.
Is there some linxu command or some python command that will print out the actual encoding of '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3' ?
 
A

alex23

AS you have seen i've been struggling days now to get a solution to this and the closing parenthesis is not the prbpoem here, unicode.

Oh really?
if they are unicode then i really see no trouble when trying to:
cur.execute('''SELECT url FROM files WHERE url = %s''', ( fullpath, )
but [t]his is what i'm still getting:
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] data = cur.fetchone() #URL is unique, so should only be one
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] ^
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] SyntaxError: invalid syntax

Unicode is not producing the SyntaxError you're seeing here.
YOU of all people should not speak at all, because you haven't helped me a bit.

Yeah, advising you _not_ to do this crap on a production machine was
clearly lost on you. That's not my failing, though, it's your's.
Its funny, how knowledge people that in facte tried to help me treat me with respect while people like you who have never been of any help tend to just bitch all the way along.

1) For many of us, this is our _profession_ and you're asking us to
provide you with _free_ support while doing SFA to resolve your
inadequate understanding.
2) If it names itself after a troll, and it trolls like a troll,
there's a pretty good chance it's a troll.
3) Your whining and begging is treating _us_ with no respect, so I
guess we're all even.

Your whole approach is one of cargo cult programming and it's tedious.
Sysadmin, educate thyself!
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 7:47:40 ð.ì.UTC+3, ï ÷ñÞóôçò alex23 Ýãñáøå:
AS you have seen i've been struggling days now to get a solution to this and the closing parenthesis is not the prbpoem here, unicode.



Oh really?


if they are unicode then i really see no trouble when trying to:
cur.execute('''SELECT url FROM files WHERE url = %s''', ( fullpath, )
but [t]his is what i'm still getting:
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] data = cur.fetchone() #URL is unique, so should only be one
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] ^
[Tue Jun 04 19:50:16 2013] [error] [client 46.12.95.59] SyntaxError: invalid syntax



Unicode is not producing the SyntaxError you're seeing here.


YOU of all people should not speak at all, because you haven't helped me a bit.



Yeah, advising you _not_ to do this crap on a production machine was

clearly lost on you. That's not my failing, though, it's your's.


Its funny, how knowledge people that in facte tried to help me treat mewith respect while people like you who have never been of any help tend tojust bitch all the way along.



1) For many of us, this is our _profession_ and you're asking us to

provide you with _free_ support while doing SFA to resolve your

inadequate understanding.

2) If it names itself after a troll, and it trolls like a troll,

there's a pretty good chance it's a troll.

3) Your whining and begging is treating _us_ with no respect, so I

guess we're all even.



Your whole approach is one of cargo cult programming and it's tedious.

Sysadmin, educate thyself!

Keep bithching professional pythoneer, you are doing great.
I'm too tired to even reply to your rumblings.
 
M

Michael Torrie

One of my Greek filenames is "Åõ÷Þ ôïõ Éçóïý.mp3". Just a Greek
filename with spaces. Is there a problem when a filename contain both
english and greek letters? Isn't it still a unicode string?

All i did in my CentOS was 'mv "Euxi tou Ihsou.mp3" "Åõ÷Þ ôïõ
Éçóïý.mp3"

and the displayed filename after 'ls -l' returned was:

is -rw-r--r-- 1 nikos nikos 3511233 Jun 4 14:11 \305\365\367\336\
\364\357\365\ \311\347\363\357\375.mp3

There is no way at all to check the charset used to store it in hdd?
It should be UTF-8, but it doesn't look like it. Is there some linxu
command or some python command that will print out the actual
encoding of '\305\365\367\336\ \364\357\365\
\311\347\363\357\375.mp3' ?

I can see that you are starting to understand things. I can't answer
your question (don't know the answer), but you're correct about one
thing. A filename is just a sequence of bytes. We'd hope it would be
utf-8, but it could be anything. Even worse, it's not possible to tell
from a byte stream what encoding it is unless we just try one and see
what happens. Text editors, for example, have to either make a guess
(utf-8 is a good one these days), or ask, or try to read from the first
line of the file using ascii and see if there's a source code character
set command to give it an idea.
 
S

Steven D'Aprano

What on eart is this damn error: Michael tried to explain to me about
surrogates but dont think i understand it.

Encoding giving me trouble years now.

[Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59] Original
exception was: [Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59]
Traceback (most recent call last): [Tue Jun 04 20:19:53 2013] [error]
[client 46.12.95.59] File "files.py", line 72, in <module> [Tue Jun 04
20:19:53 2013] [error] [client 46.12.95.59] cur.execute('''SELECT
url FROM files WHERE url = %s''', (fullpath,) ) [Tue Jun 04 20:19:53
2013] [error] [client 46.12.95.59] File
"/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/ cursors.py",
line 108, in execute [Tue Jun 04 20:19:53 2013] [error] [client
46.12.95.59] query = query.encode(charset) [Tue Jun 04 20:19:53
2013] [error] [client 46.12.95.59] UnicodeEncodeError: 'utf-8' codec
can't encode character '\\udcd3' in position 61: surrogates not allowed



PLEASE TELL EM WHAT TO TRY, PLEASE FOR THE LOVE OF GOD, IAM SO
FRUSTRATED NOT BEING ABLE TO DEAL WITH THIS.

Calm down. I know it is frustrating.

On a Linux system, the file system stores bytes, and only bytes. The file
system does no validation of the bytes you give, except to check that
there are no 0x00 and 0x2f bytes (ASCII '\0' and '/') in the file name.
That's all.

So, if one program thinks that it should be sending file names in, say,
UTF-16 or or ISO-8859-7 encoding, it will take a string like "Îικόλαος"
and the file system will see bytes like these:

py> s = 'Îικόλαος'
py> s.encode('UTF-16be')
b'\x03\x9d\x03\xb9\x03\xba\x03\xcc\x03\xbb\x03\xb1\x03\xbf\x03\xc2'

py> s.encode('iso-8859-7')
b'\xcd\xe9\xea\xfc\xeb\xe1\xef\xf2'


Notice that the same string gives you completely different bytes. And
likewise, the same bytes will give you different strings, depending on
the encoding you use.


Now, if you try to read the file name using a program that expects UTF-8,
it will either see some sort of mojibake garbage characters, or get some
sort of error:

py> s.encode('UTF-16be').decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 1:
invalid start byte

py> s.encode('iso-8859-7').decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0:
invalid continuation byte


Somehow, I don't know how because I didn't see it happen, you have one or
more files in that directory where the file name as bytes is invalid when
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you
need to rename the file using some tool that doesn't care quite so much
about encodings. Use the bash command line to rename each file in turn
until the problem goes away.
 
S

Steven D'Aprano

One of my Greek filenames is "Ευχή του ΙησοÏ.mp3". Just a Greek filename
with spaces.
Is there a problem when a filename contain both english and greek
letters? Isn't it still a unicode string?

No problem, and Unicode includes both English and Greek letters.

All i did in my CentOS was 'mv "Euxi tou Ihsou.mp3" "Ευχή του ΙησοÏ.mp3"

That's not what you wrote earlier. You said you used FileZilla to
transfer the files from Windows 8.

and the displayed filename after 'ls -l' returned was:

is -rw-r--r-- 1 nikos nikos 3511233 Jun 4 14:11 \305\365\367\336\
\364\357\365\ \311\347\363\357\375.mp3

There is no way at all to check the charset used to store it in hdd? It
should be UTF-8, but it doesn't look like it. Is there some linxu
command or some python command that will print out the actual encoding
of '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3' ?

You have misunderstood.

The Linux file system does not track encodings. It just stores bytes.

There is no *reliable* way to guess the encoding that a bunch of bytes
came from. If your bytes look like

0x48 0x65 0x6c 0x6c 0x6f 0x20 0x77 0x6f 0x72 0x6c 0x64 0x21

(ASCII "Hello World!") then you might *guess* that the encoding is ASCII,
or UTF-8, or Latin-1. But in general, you can't go from the bytes to the
encoding. Encodings are out-of-band information.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 8:40:39 ð.ì.UTC+3, ï ÷ñÞóôçò Michael Torrie Ýãñáøå:
I can see that you are starting to understand things. I can't answer
your question (don't know the answer), but you're correct about one
thing. A filename is just a sequence of bytes. We'd hope it would be
utf-8, but it could be anything. Even worse, it's not possible to tell
from a byte stream what encoding it is unless we just try one and see
what happens. Text editors, for example, have to either make a guess
(utf-8 is a good one these days), or ask, or try to read from the first
line of the file using ascii and see if there's a source code character
set command to give it an idea.


Um, is there a way even if we don't actually know the encoding CentOS used to store the filename to hdd to tell Python to just open the bytestream as it is?

I don't know if its possible, but iam looking for a way to skip the encoding, since we have now way of knowing what this is.

This is very weird because:


(e-mail address removed) [~]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
(e-mail address removed) [~]#

all i did it was a simple rename from english to greek. Since locale is setto use utf8, shouldnt the result in the hdd be an utf-8 bytestream?
 
S

Steven D'Aprano

Please run these commands, and show what result they give:

alias ls

printf %q\\n *.mp3

ls -b *.mp3


Do you have an answer for this yet? Better still, change the last two
commands to this:


printf %q\\n *

ls -b *

If all else fails, you could just rename the troublesome file and
hopefully the problem will go away:

mv *Ο.mp3 1.mp3
mv 1.mp3 Eυχή του ΙησοÏ.mp3


Of course that second command is wrong, it needs quotes:

mv 1.mp3 "Eυχή του ΙησοÏ.mp3"
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:03:41 ð.ì.UTC+3, ï ÷ñÞóôçò Steven D'Aprano Ýãñáøå:
The Linux file system does not track encodings. It just stores bytes.
There is no *reliable* way to guess the encoding that a bunch of bytes
came from. If your bytes look like
0x48 0x65 0x6c 0x6c 0x6f 0x20 0x77 0x6f 0x72 0x6c 0x64 0x21
(ASCII "Hello World!") then you might *guess* that the encoding is ASCII,
or UTF-8, or Latin-1. But in general, you can't go from the bytes to the
encoding. Encodings are out-of-band information.


Your explanation of encoding/decoding is excellent and iam storing this Steven!
So what i understand now is:

encoding = string -> (some charset used) -> charset bytes
decoding = bytes -> (have to know what charset has been used) -> originalstring

Have i understtod corrctly, that the *key* to the whole encode/decode process is the charset used/has to be used?

string = 'Åõ÷Þ ôïõ Éçóïý.mp3'
abive string in unknown charset bytes = '\305\365\367\336\364\357\365\ \311\347\363\357\375.mp3'

We dont know they key(charset) used, but we do know the original form of the string, so it occured to me that if we write a python script to decode the mojabike bytestream to all available charsets then as some point the original string will appear back!


Won't you agree steven? Of course if that is likeley to work i don't know how to write it.


Hre is the comamnds you asked.
-----------------------------------------
(e-mail address removed) [~/www/data/apps]# printf %q\n\n *
100\ Mythoi\ tou\ Aiswpou.pdfnnAnekdotologio.exennBattleship.exenn$'\323\352\335 \370\357\365 \335\355\341\355 \341\361\351\350\354\374.exe'nnKosmas\ o\ Aitwlos\ -\ Profiteies.pdfnnLuxor\ Evolved.exennMonopoly.exenn$'\305\365\367\336 \364\35 7\365 \311\347\363\357\375.mp3'nnOnline\ Movie\ Player.zipnnO\ Nomos\ tou\ Merfy \ v1-2-3.zipnnOrthodoxo\ Imerologio.exennPac-Man.exennScrabble.exennTo\ 1o\ mou\ vivlio\ gia\ to\ skaki.pdfnnVivlos\ gia\ Atheofovous.pdfnnV-Radio\ v2.4.msinnni
(e-mail address removed) [~/www/data/apps]# ls -b *
100\ Mythoi\ tou\ Aiswpou.pdf* Online\ Movie\ Player.zip*
Anekdotologio.exe* O\ Nomos\ tou\ Merfy\ v1-2-3.zip
Battleship.exe Orthodoxo\ Imerologio.exe*
\323\352\335\370\357\365\ \335\355\341\355\ \341\361\351\350\354\374.exe Pac-Man.exe
Kosmas\ o\ Aitwlos\ -\ Profiteies.pdf* Scrabble.exe
Luxor\ Evolved.exe To\ 1o\ mou\ vivlio\ gia\ to\ skaki.pdf*
Monopoly.exe Vivlos\ gia\ Atheofovous.pdf*
\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3 V-Radio\ v2.4.msi
(e-mail address removed) [~/www/data/apps]#
 
M

MRAB

I can see that you are starting to understand things. I can't answer
your question (don't know the answer), but you're correct about one
thing. A filename is just a sequence of bytes. We'd hope it would be
utf-8, but it could be anything. Even worse, it's not possible to tell
from a byte stream what encoding it is unless we just try one and see
what happens. Text editors, for example, have to either make a guess
(utf-8 is a good one these days), or ask, or try to read from the first
line of the file using ascii and see if there's a source code character
set command to give it an idea.
From the previous posts I guessed that the filename might be encoded
using ISO-8859-7:
'Åõ÷Þ\\ ôïõ\\ Éçóïý.mp3'

Yes, that looks the same.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 7:44:14 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:

'����\\ ���\\ �����.mp3'

Yes, that looks the same.


You are decoding the "uknown" filename bytestream pretending to know that greek-iso was used to encode it into bytes.

But if that was the case then the originsal sting would have to be 'Åõ÷Þ ôïõ Éçóïõ.mp3' and not '����\\ ���\\ �����.mp3'
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 8:56:36 ð.ì.UTC+3, ï ÷ñÞóôçò Steven D'Aprano Ýãñáøå:

Somehow, I don't know how because I didn't see it happen, you have one or
more files in that directory where the file name as bytes is invalid when
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you
need to rename the file using some tool that doesn't care quite so much
about encodings. Use the bash command line to rename each file in turn
until the problem goes away.

But renaming ia hsell access like 'mv 'Euxi tou Ihsou.mp3' 'Åõ÷Þ ôïõ Éçóïõ.mp3' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3'

But please tell me Steven what linux tool you think it can encode the weirdfilename to proper 'Åõ÷Þ ôïõ Éçóïõ.mp3' utf-8?

or we cna write a script as i suggested to decode back the bytestream usingall sorts of available decode charsets boiling down to the original greek letters.
 
M

MRAB

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 8:56:36 ð.ì. UTC+3, ï ÷ñÞóôçò Steven D'Aprano Ýãñáøå:

Somehow, I don't know how because I didn't see it happen, you have one or
more files in that directory where the file name as bytes is invalid when
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you
need to rename the file using some tool that doesn't care quite so much
about encodings. Use the bash command line to rename each file in turn
until the problem goes away.

But renaming ia hsell access like 'mv 'Euxi tou Ihsou.mp3' 'Åõ÷Þ ôïõ Éçóïõ.mp3' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3'

But please tell me Steven what linux tool you think it can encode the weird filename to proper 'Åõ÷Þ ôïõ Éçóïõ.mp3' utf-8?

or we cna write a script as i suggested to decode back the bytestream using all sorts of available decode charsets boiling down to the original greek letters.
Using Python, I think you could get the filenames using os.listdir,
passing the directory name as a bytestring so that it'll return the
names as bytestrings.

Then, for each name, you could decode from its current encoding and
encode to UTF-8 and rename the file, passing the old and new paths to
os.rename as bytestrings.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,138
Messages
2,570,801
Members
47,348
Latest member
nethues

Latest Threads

Top