Changing filenames from Greeklish => Greek (subprocess complain)

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
N

nagia.retsina

Τη ΠαÏασκευή, 7 Ιουνίου 2013 4:25:40 Ï€.μ. UTC+3, ο χÏήστης Steven D'Aprano έγÏαψε:
MRAB tells you to work with the bytes, because the filenames' bytes are
invalid decoded as UTF-8. If you fix the file names by renaming using a
terminal set to UTF-8, then they will be valid and you can forget about
working with bytes.

Yes, but but 'putty' seems to always forget when i tell it to use utf8 for displaying and always picks up the Win8's default charset and it doesnt have a save options dialog. I cant always remember to switch to utf8 charset or renaming all the time from termnal so many greek filenames.
Working with bytes is only for when the file names are turned to garbage.
Your file names (some of them) are turned to garbage. Fix them, and then
use file names as strings.

Can't '~/data/apps/' is filled every day with more and more files which areuploaded via FileZilla client, which i think it behaves pretty much like putty, uploading filenames as greek-iso bytes.

So that garbage will happen every day due to 'Putty' & 'FileZilla' clients.

So files.py before doing their stuff must do the automatic conversions fromgreek bytes to utf-8 bytes.

Here is what i have up until now.

=================================================
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filenames_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename filename from greek bytes => utf-8 bytes
os.rename( filepath_bytes filepath.encode('utf-8') )
except UnicodeDecodeError:
print "I give up! This filename is unreadable!"
=========================================

This is the best i can come up with, but after:

(e-mail address removed) [~/www/cgi-bin]# python files.py
File "files.py", line 75
os.rename( filepath_bytes filepath.encode('utf-8') )
^
SyntaxError: invalid syntax
(e-mail address removed) [~/www/cgi-bin]#
============================================


I am seeign the caret pointing at filepath but i cant follow what it tries to tell me. No parenthesis missed or added this time due to speed and tireness.

This rename statement tries to convert the greek byted filepath to utf-8 byted filepath.

I can't see whay this is wrong though.
 
C

Chris Angelico

Yes, but but 'putty' seems to always forget when i tell it to use utf8 for displaying and always picks up the Win8's default charset and it doesnt have a save options dialog. I cant always remember to switch to utf8 charsetor renaming all the time from termnal so many greek filenames.


I use PuTTY too (though that'll change when I next upgrade Traal, as
I'll no longer have any Windows clients), and it's set to UTF-8 in the
Winoow|Translation page. Far as I know, those settings are all saved
into the Saved Sessions settings, back on the Session page.

ChrisA
 
Î

Îικόλαος ΚοÏÏας

| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χÏήστης Steven D'Aprano έγÏαψε:
| > py> s = '999-Eυχή-του-ΙησοÏ'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t)
| > 999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ
|
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.
How can it be correct? We have encoded out string in utf-8 and then we
tried to decode it as greek-iso? How can this possibly be correct?
| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.
Same as above, i don't understand it at all, since different
charsets(encodings) used in the encode/decode process.
|
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state.
That means that, when a linux application needs to saved a filename to
the linux filesystem, the app checks the filesytem's 'locale', so to
encode the filename using the utf-8 charset ?
And likewise when a linux application wants to decode a filename is also
checking the filesystem's 'locale' setting so to know what charset must
use to decode the filename correctly back to the original string?

So locale is used for filesystem itself and linux apps to know how to
read(decode) and write(enode) filenames from/into the system's hdd?
| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.

Cant quite grasp the idea:

local end: Win8, locale = greek-iso
remote end: CentOS 6.4, locale = utf-8

FileZilla by default uses "do not know what charset" to upload filenames
Putty by default uses greek-iso to display filenames


WHAT someone can expect to happen when all of the above work together?
Mess of course, but i want to hear in detail each step of the mess as it
emerges.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐáñáóêåõÞ, 7 Éïõíßïõ 2013 9:46:53 ð.ì. UTC+3, ï ÷ñÞóôçò Chris Angelico Ýãñáøå:
I use PuTTY too (though that'll change when I next upgrade Traal, as

I'll no longer have any Windows clients), and it's set to UTF-8 in the

Winoow|Translation page. Far as I know, those settings are all saved

into the Saved Sessions settings, back on the Session page.



ChrisA


Session settings afaik is for putty to remember hosts to connect to, not terminal options. I might be worng though. No matter how many times i change its options next time i run it always defaults back.

I'll google Traal right now.
You should also take o look on 'Secure Shell' extension for Chrome i just found out.

Seems a great plugin for Chrome. You'll definately like it, i did!
 
L

Lele Gaifax

File "files.py", line 75
os.rename( filepath_bytes filepath.encode('utf-8') )
^
SyntaxError: invalid syntax

I am seeign the caret pointing at filepath but i cant follow what it
tries to tell me.

As already explained, often a SyntaxError is introduced by *preceeding*
"text", so you must look at your code with a "wider eye".
This rename statement tries to convert the greek byted filepath to
utf-8 byted filepath.

Yes: and that usually imply that the *function* accepts (at least) *two*
arguments, specifically the source and the target names, right? How many
arguments are you actually giving to the os.rename() function above?
I can't see whay this is wrong though.

Try stronger, I won't be give you further indications to your
SyntaxErrors, you *must* learn how to detect and fix those by yourself.

ciao, lele.
 
C

Chris Angelico

I'll google Traal right now.

The one thing you're actually willing to go research, and it's
actually something that won't help you. Traal is the name of my
personal laptop. Spend your Googletrons on something else. :)

ChrisA
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐáñáóêåõÞ, 7 Éïõíßïõ 2013 10:09:29 ð.ì. UTC+3, ï ÷ñÞóôçò Lele Gaifax Ýãñáøå:
As already explained, often a SyntaxError is introduced by *preceeding*
"text", so you must look at your code with a "wider eye".

That what i ahte aabout error reporting. You have some syntax error someplace and error reports you another line, so you have to check the whole code again.
Well i just did, i see no syntactical errors.
Yes: and that usually imply that the *function* accepts (at least) *two*
arguments, specifically the source and the target names, right? How many
arguments are you actually giving to the os.rename() function above?

i'm giving it two.
os.rename( filepath_bytes filepath.encode('utf-8') )

1st = filepath_bytes
2nd = filepath.encode('utf-8')

Source and Target respectively.
 
M

Michael Weylandt

Τη ΠαÏασκευή, 7 Ιουνίου 2013 10:09:29 Ï€.μ. UTC+3, ο χÏήστης Lele Gaifax έγÏαψε:


That what i ahte aabout error reporting. You have some syntax error someplace and error reports you another line, so you have to check the whole code again.
Well i just did, i see no syntactical errors.


i'm giving it two.
os.rename( filepath_bytes filepath.encode('utf-8')

Missing comma, which is, after all, just a matter of syntax so it can't matter, right?
 
Î

Îικόλαος ΚοÏÏας

os.rename( filepath_bytes filepath.encode('utf-8')
Missing comma, which is, after all, just a matter of syntax so it can't matter, right?
I doubted that os.rename arguments must be comma seperated.
But ater reading the docs.

s.rename(/src/,/dst/)<http://docs.python.org/2/library/os.html#os.rename>

Rename the file or directory/src/to/dst/. If/dst/is a
directory,OSError
<http://docs.python.org/2/library/exceptions.html#exceptions.OSError>will
be raised. On Unix, if/dst/exists and is a file, it will be replaced
silently if the user has permission. The operation may fail on some
Unix flavors if/src/and/dst/are on different filesystems. If
successful, the renaming will be an atomic operation (this is a
POSIX requirement). On Windows, if/dst/already exists,OSError
<http://docs.python.org/2/library/exceptions.html#exceptions.OSError>will
be raised even if it is a file; there may be no way to implement an
atomic rename when/dst/names an existing file.

Availability: Unix, Windows.

Indeed it has to be:

os.rename( filepath_bytes, filepath.encode('utf-8')

'mv source target' didn't require commas so i though it was safe to assume that os.rename did not either.


I'am happy to announce that after correcting many idiotic error like commas, missing colons and declaring of variables, this surrogate erro si the last i get.
I still dont understand what surrogate means. In english means replacement.
Here is the code:


#========================================================
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filename_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename current filename from greek bytes => utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
except UnicodeDecodeError:
print( '''I give up! This filename is unreadable! ''')


#========================================================
# Get filenames of the apps directory as unicode
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone() #filename is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filenames.add( filename )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames:
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) )



=================================

[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] File "/home/nikos/public_html/cgi-bin/files.py", line 88, in <module>
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] query = query.encode(charset)
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed
 
S

Steven D'Aprano

Can't '~/data/apps/' is filled every day with more and more files which
are uploaded via FileZilla client, which i think it behaves pretty much
like putty, uploading filenames as greek-iso bytes.


Well, that is certainly a nuisance. Try something like this:

# Untested.

dir = b'/home/nikos/public_html/data/apps/' # This must be bytes.
files = os.listdir(dir)
for name in files:
pathname_as_bytes = dir + name
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
pathname = pathname_as_bytes.decode(encoding)
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8.
if encoding != 'utf-8':
os.rename(pathname_as_bytes, pathname.encode('utf-8'))
assert os.path.exists(pathname)
break
else:
# This only runs if we never reached the break.
raise ValueError('unable to clean filename %r'%pathname_as_bytes)
 
R

Roel Schroeven

Îικόλαος ΚοÏÏας schreef:
Session settings afaik is for putty to remember hosts to connect to,
not terminal options. I might be worng though. No matter how many times
i change its options next time i run it always defaults back.

Putty can most definitely remember its settings:
- Start PuTTY; you should get the "PuTTY Configuration" window
- Select a session in the list of sessions
- Click Load
- Change any setting you want to change
- Go back to Session in the Category treeview
- Click Save

HTH

--
"People almost invariably arrive at their beliefs not on the basis of
proof but on the basis of what they find attractive."
-- Pascal Blaise

(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,137
Messages
2,570,795
Members
47,342
Latest member
eixataze

Latest Threads

Top