Changing filenames from Greeklish => Greek (subprocess complain)

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
M

MRAB

First of all thank you for helping me MRAB.
After make some alternation to your code ia have this:

----------------------------------------
# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b"/home/nikos/public_html/data/apps/"

# Setting TESTING to True will make it print out what renamings it will do, but not actually do them
TESTING = True

# Walk through the files.
for root, dirs, files in os.walk( path ):
for filename in files:
try:
# Is this name encoded in UTF-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid UTF-8
# It appears that the filenames are encoded in ISO-8859-7, so decode from that and re-encode to UTF-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
if TESTING:
print( '''<br>Will rename {!r} ---> {!r}<br><br>'''.format( old_path, new_path ) )
else:
print( '''<br>Renaming {!r} ---> {!r}<br><br>'''.format( old_path, new_path ) )
os.rename( old_path, new_path )
sys.exit(0)
-------------------------

and the output can be seen here: http://superhost.gr/cgi-bin/files.py

We are in test mode so i dont know if when renaming actually take place what the encodings will be.

Shall i switch off test mode and try it for real?
The first one is '/home/nikos/public_html/data/apps/Ευχή του ΙησοÏ.mp3'.

The second one is '/home/nikos/public_html/data/apps/Σκέψου έναν
αÏιθμό.exe'.

These names are currently encoded in ISO-8859-7, but will be encoded in
UTF-8 if you turn off test mode.

If you're happy for that change to happen, then go ahead.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 3:50:52 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:
If you're happy for that change to happen, then go ahead.

I have made some modifications to the code you provided me but i think something that doesnt accur to me needs fixing.


for example i switched:

# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b"/home/nikos/public_html/data/apps/"

# Walk through the files.
for root, dirs, files in os.walk( path ):
for filename in files:

to:

# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for fullpath in path
# Grabbing just the filename from path
filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )


I dont know if it has the same effect:
Here is the the whole snippet:


=============================================
# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for fullpath in path
# Grabbing just the filename from path
filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestream-> utf-8 bytestream
old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
os.rename( old_path, new_path )


#============================================================
# Compute a set of current fullpaths
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )
==================================================================

The error is:
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] File "files.py", line 64
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] for fullpath in path
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax


Doesn't os.listdir( ...) returns a list with all filenames?

But then again when replacing take place to shert the fullpath to just the filane i think it doesn't not work because the os.listdir was opened as bytestring and not as a string....

What am i doing wrong?
 
Í

Íéêüëáïò Êïýñáò

Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χÏήστης Steven D'Aprano έγÏαψε:
py> s = '999-Eυχή-του-ΙησοÏ'
py> bytes_as_utf8 = s.encode('utf-8')
py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
py> print(t)
999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ

errors='replace' mean dont break in case or error?
You took the unicode 's' string you utf-8 bytestringed it.
Then how its possible to ask for the utf8-bytestring to decode back to unicode string with the use of a different charset that the one used for encoding and thsi actually printed the filename in greek-iso?

So that demonstrates part of your problem: even though your Linux system
is using UTF-8, your terminal is probably set to ISO-8859-7. The
interaction between these will lead to strange and disturbing Unicode
errors.

Yes i feel this is the problem too.
Its a wonder to me why putty used by default greek-iso instead of utf-8 !!

Please explain this t me because now that i begin to understand this encode/decode things i begin to like them!

a) WHAT does it mean when a linux system is set to use utf-8?
b) WHAT does it mean when a terminal client is set to use utf-8?
c) WHAT happens when the two of them try to work together?

So I believe I understand how your file name has become garbage. To fix
it, make sure that your terminal is set to use UTF-8, and then rename it.
Do the same with every file in the directory until the problem goes away.

(e-mail address removed) [~/www/cgi-bin]# echo $LS_OPTIONS
--color=tty -F -a -b -T 0

Is this okey? The '-b' option is for to display a filename in binary mode?

Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays the file in correct greek letters. Switching putty's encoding back to 'greek-iso' then the *displayed* filanames shows in mojabike.

WHAT is being displayed and what is actually stored as bytes is two different thigns right?

Ευχη του Ιησου.mp3
EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ

is the way the filaname is displayed in the terminal depending on the encoding the terminal uses, correct? But no matter *how* its being dislayed those two are the same file?
 
L

Lele Gaifax

Îικόλαος ΚοÏÏας said:
...
# Load'em
for fullpath in path:
try:
...

The error is:
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] File "files.py", line 64
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] for fullpath in path
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax


Doesn't os.listdir( ...) returns a list with all filenames?

You should *read* and *understand* the error message!

This is the same kind of confusion you had when I pointed you at the
missing closing bracket some day ago, when you missed the meaning of the
error and assume it's source is related to something completely
different...

In the specific case, your line 64 is missing an ending colon (":").

ciao, lele.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 10:03:02 ì.ì. UTC+3, ï ÷ñÞóôçò Lele Gaifax Ýãñáøå:
# Load'em
for fullpath in path:



The error is:
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] File "files..py", line 64
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] for fullpath in path
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax
Doesn't os.listdir( ...) returns a list with all filenames?



You should *read* and *understand* the error message!



This is the same kind of confusion you had when I pointed you at the

missing closing bracket some day ago, when you missed the meaning of the

error and assume it's source is related to something completely

different...



In the specific case, your line 64 is missing an ending colon (":").



ciao, lele.

Oh my God, was that simple and i was smashing my head to see where did i made a synatx error. Missed the colon! Well the error shoudl ahve said "Hey man, you missed a colon!", that would help a lot.

Now the error afetr fixithg that transformed to:

[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] TypeError: expected bytes, bytearray or buffer compatible object


but htats becaus eof these lines:

# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for fullpath in path:
# Grabbing just the filename from path
filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )

i can remove the bianry openign from os.listdir but then this will not work..
MRAB has told me that i need to open those paths and filenames as bytestreams and not as unicode strings.
 
M

MRAB

On 06/06/2013 19:13, Íéêüëáïò Êïýñáò wrote:




Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 3:50:52 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå: &gt; If you're happy for that change to happen, then go ahead. I have made some modifications to the code you provided me but i think something that doesnt accur to me needs fixing. for example i switched: # Give the path as a bytestring so that we'll get the filenames as bytestrings path = b"/home/nikos/public_html/data/apps/" # Walk through the files. for root, dirs, files in os.walk( path ): for filename in files: to: # Give the path as a bytestring so that we'll get the filenames as bytestrings path = os.listdir( b'/home/nikos/public_html/data/apps/' )

os.listdir returns a list of the names of the objects in the given directory.



# iterate over all filenames in the apps directory

Exactly, all the names.



for fullpath in path # Grabbing just the filename from path

The name is a bytestring. Note, name, NOT full path.

The following line will fail because the name is a bytestring, and you can't mix bytestrings with Unicode strings:


filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )

                                           ^ bytestring                ^ Unicode string                                                 ^ Unicode string


I dont know if it has the same effect: Here is the the whole snippet: ============================================= # Give the path as a bytestring so that we'll get the filenames as bytestrings path = os.listdir( b'/home/nikos/public_html/data/apps/' ) # iterate over all filenames in the apps directory for fullpath in path # Grabbing just the filename from path filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' ) try: # Is this name encoded in utf-8? filename.decode('utf-8') except UnicodeDecodeError: # Decoding from UTF-8 failed, which means that the name is not valid utf-8 # It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8 new_filename = filename.decode('iso-8859-7').encode('utf-8') # rename filename form greek bytestream-&gt; utf-8 bytestream old_path = os.path.join(root, filename) new_path = os.path.join(root, new_filename) os.rename( old_path, new_path ) #============================================================ # Compute a set of current fullpaths path = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for fullpath in path: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) ) data = cur.fetchone() #URL is unique, so should only be one if not data: # First time for file; primary key is automatic, hit is defaulted cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) ================================================================== The error is: [Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] File "files.py", line 64 [Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] for fullpath in path [Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] ^ [Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax Doesn't os.listdir( ...) returns a list with all filenames? But then again when replacing take place to shert the fullpath to just the filane i think it doesn't not work because the os.listdir was opened as bytestring and not as a string.... What am i doing wrong?

You're changing things without checking what they do!
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 10:42:25 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:
If you're happy for that change to happen, then go ahead.

I have made some modifications to the code you provided me but i think something that doesnt accur to me needs fixing.


for example i switched:

# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b"/home/nikos/public_html/data/apps/"

# Walk through the files.
for root, dirs, files in os.walk( path ):
for filename in files:

to:

# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )


os.listdir returns a list of the names of the objects in the given
directory.




# iterate over all filenames in the apps directory


Exactly, all the names.




for fullpath in path
# Grabbing just the filename from path


The name is a bytestring. Note, name, NOT full path.



The following line will fail because the name is a bytestring,
and you can't mix bytestrings with Unicode strings:


filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )

� � �������������������������������������� ^ bytestring���
����������� ^ Unicode string���������� �
����������������������������������� ^ Unicode string


I dont know if it has the same effect:
Here is the the whole snippet:


=============================================
# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for fullpath in path
# Grabbing just the filename from path
filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode fromthat and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestream-> utf-8 bytestream
old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
os.rename( old_path, new_path )


#============================================================
# Compute a set of current fullpaths
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )
==================================================================

The error is:
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] File "files.py", line 64
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] for fullpath in path
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 21:10:23 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax


Doesn't os.listdir( ...) returns a list with all filenames?

But then again when replacing take place to shert the fullpath to just the filane i think it doesn't not work because the os.listdir was opened as bytestring and not as a string....

What am i doing wrong?


You're changing things without checking what they do!

Ah yes, it retruens filenames, not path/to/filenames



#========================================================
# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestream-> utf-8 bytestream
old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
os.rename( old_path, new_path )


#========================================================
# Compute a set of current fullpaths
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

for fullpath in data:
if fullpath not in "What should be written here in place of ditched set"
cur.execute('''DELETE FROM files WHERE url = %s''', (fullpath,) )

=============================

a) Is it correct that the first time i open os.listdir() as binary to grab the fileenames as bytestring and the 2nd normally to grab the filanems as unicode strings?

b) My spurious procedure is messed up now that i ditch the set fullpaths()
 
Í

Íéêüëáïò Êïýñáò

Actually about the Spurious procedure iam happy with myelf that came up with this:

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

for filename in path
url = '/home/nikos/public_html/data/apps/' + filename
urls.add( url )

for url in data:
if url not in urls
cur.execute('''DELETE FROM files WHERE url = %s''', (url,) )


Ddint try it yet though, need to anwer previous post's

a) Is it correct that the first time i open os.listdir() as binary to grab the fileenames as bytestring and the 2nd normally to grab the filanems as unicode strings?
 
L

Lele Gaifax

Îικόλαος ΚοÏÏας said:
Now the error afetr fixithg that transformed to:

[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] TypeError: expected bytes, bytearray or buffer compatible object

MRAB has told me that i need to open those paths and filenames as bytestreams and not as unicode strings.

Yes, that way the function will return a list of bytes
instances. Knowing that, consider the following example, that should
ring a bell:

$ python3
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 09:59:04)
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

ciao, lele.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 11:25:15 ì.ì. UTC+3, ï ÷ñÞóôçò Lele Gaifax Ýãñáøå:
Now the error afetr fixithg that transformed to:
[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] filename = fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
[Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] TypeError: expected bytes, bytearray or buffer compatible object
MRAB has told me that i need to open those paths and filenames as bytestreams and not as unicode strings.



Yes, that way the function will return a list of bytes

instances. Knowing that, consider the following example, that should

ring a bell:



$ python3

Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 09:59:04)

[GCC 4.7.2] on linux

Type "help", "copyright", "credits" or "license" for more information..

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: expected bytes, bytearray or buffer compatible object

b'/path'

Ah yes, very logical, i should have though of that.
Tahnks here is what i have up until now with many corrections.


#========================================================
# Get filenames of the apps directory as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + b'new_filename')
os.rename( old_path, new_path )


#========================================================
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#========================================================
# Empty set that will be filled in with 'path/to/filename' of path dir
urls = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
url = '/home/nikos/public_html/data/apps/' + filename
urls.add( url )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's urls against path's urls
for url in data:
if url not in urls
cur.execute('''DELETE FROM files WHERE url = %s''', (url,) )
==================================

I think its ready! But i want to hear from you, before i try it! :)
 
Í

Íéêüëáïò Êïýñáò

Has some errors:

#========================================================
# Get filenames of the apps directory as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + b'new_filename')
os.rename( old_path, new_path )


#========================================================
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone() #filename is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#========================================================
path = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
filenames.add( filename )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) )
-------------------------------

The only problem now is the bytestrings:

(e-mail address removed) [~/www/cgi-bin]# [Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] File "files.py", line 78
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax


Dont know how to add a bytestremed path to a bytestream filename
 
Í

Íéêüëáïò Êïýñáò

I'm very sorry for continuous pastes.
Didnt include the whole thing before.
Here it is:


#========================================================
# Get filenames of the path dir as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try:
# Is this name encoded in utf-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid utf-8

# It appears that this filename is encoded in greek-iso, so decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + b'new_filename')
os.rename( old_path, new_path )


#========================================================
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
data = cur.fetchone() #filename is unique, so should only be one

if not data:
# First time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#========================================================
path = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
filenames.add( filename )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) )
=====================================

Just the bytestream error and then i belive its ready this time.
 
L

Lele Gaifax

Îικόλαος ΚοÏÏας said:
Tahnks here is what i have up until now with many corrections.

I'm afraid many more are needed :)
...
# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + b'new_filename')
os.rename( old_path, new_path )

a) there are two syntax errors, you have spurious close brackets there
b) you are basically assigning *constant* expressions to both variables,
most probably not what you meant

ciao, lele.
 
L

Lele Gaifax

Îικόλαος ΚοÏÏας said:
The only problem now is the bytestrings:

*One*, not the *only*.
(e-mail address removed) [~/www/cgi-bin]# [Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] File "files.py", line 78
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] ^
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax


Dont know how to add a bytestremed path to a bytestream filename

Come on Niklos, either you learn from what I (and others) try to teach
you, or I'm afraid you won't get more hints... this list cannot become
your remote editor tool!

*Read* the error message, *look* at the arrow (i.e. the caret character
"^"), *understand* what that is trying to tell you...

ciao, lele.
 
M

MRAB

I'm afraid many more are needed :)


a) there are two syntax errors, you have spurious close brackets there
b) you are basically assigning *constant* expressions to both variables,
most probably not what you meant
Yet again, he's changed things unnecessarily, and the code was meant
only as a one-time
fix to correct the encoding of some filenames. :-(
 
C

Cameron Simpson

| We are in test mode so i dont know if when renaming actually take place what the encodings will be.
| Shall i switch off test mode and try it for real?

I would make a copy. Since you're renaming stuff, hard links would do:

cp -rpl original-dir test-dir

Then test stuff in test-dir.
 
C

Cameron Simpson

| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χÏήστης Steven D'Aprano έγÏαψε:
| > py> s = '999-Eυχή-του-ΙησοÏ'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t)
| > 999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ
|
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.

| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.

| > So that demonstrates part of your problem: even though your Linux system
| > is using UTF-8, your terminal is probably set to ISO-8859-7. The
| > interaction between these will lead to strange and disturbing Unicode
| > errors.
|
| Yes i feel this is the problem too.
| Its a wonder to me why putty used by default greek-iso instead of utf-8 !!

Putty will get its terminal setting from the system you came from.
I suppose Windows of some kind. If you look at Putty's settings you
may be able to specify UTF-8 explicitly; not sure. If you can, do
that. At least there will be one less layer of confusion to debug.

| Please explain this t me because now that i begin to understand
| this encode/decode things i begin to like them!
|
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state. There
will also be some system settings with defaults for stuff started
up by the system. On CentOS and RedHat that is probably the file:

/etc/sysconfig/i18n

_However_, when you ssh in to the system using Putty or another ssh
client, the settings at your local end are passes to the remote ssh
session. In this way different people using different locales can
ssh in and get the locales they expect to use.

Of course, of the locale settings differ and these people are working
on the same files and text, madness will ensue.

| b) WHAT does it mean when a terminal client is set to use utf-8?

It means the _display_ end of the terminal will render characters
using UTF-8. Data comes from the remote system as a sequence of
bytes. The terminal receives these bytes and _decodes_ them using
utf-8 (or whatever) in order to decides what characters to display.

| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.

| > So I believe I understand how your file name has become garbage. To fix
| > it, make sure that your terminal is set to use UTF-8, and then rename it.
| > Do the same with every file in the directory until the problem goes away.
|
| (e-mail address removed) [~/www/cgi-bin]# echo $LS_OPTIONS
| --color=tty -F -a -b -T 0
|
| Is this okey? The '-b' option is for to display a filename in binary mode?

Probably. "man ls" will tell you.

Personally, I "unalias ls" on RedHat systems (and any other system
where an alias has been set up). I want ls to do what I say, not
what someone else thought was a good idea.

| Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays
| the file in correct greek letters. Switching putty's encoding back
| to 'greek-iso' then the *displayed* filanames shows in mojabike.

Exactly so.

| WHAT is being displayed and what is actually stored as bytes is two different thigns right?

Yes. Display requires the byte stream to be decoded. Wrong decoding
display wrong characters/glyphs.

| Ευχη του Ιησου.mp3
| EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ
|
| is the way the filaname is displayed in the terminal depending
| on the encoding the terminal uses, correct? But no matter *how* its
| being dislayed those two are the same file?

In principle, yes. Nothing has changed on the filesystem itself.

Cheers,
--
Cameron Simpson <[email protected]>

You write code in a proportional serif? No wonder you got extra
semicolons falling all over the place.
No, I *dream* about writing code in a proportional serif font.
It's much more exciting than my real life.
/* dan: THE Anti-Ged -- Ignorant Yank (tm) #1, none-%er #7 */
Dan Nitschke (e-mail address removed) (e-mail address removed)
 
S

Steven D'Aprano

i can remove the bianry openign from os.listdir but then this will not
work. MRAB has told me that i need to open those paths and filenames as
bytestreams and not as unicode strings.

Do you understand why?

If you do not understand *why* we tell you to do a thing, then you have
no understanding and are doing Cargo Cult programming:

http://en.wikipedia.org/wiki/Cargo_cult_programming
http://en.wikipedia.org/wiki/Cargo_cult


MRAB tells you to work with the bytes, because the file names' bytes are
invalid when used as UTF-8. If you fix the file names by renaming using a
terminal set to UTF-8, then they will be valid and you can forget about
working with bytes.

Working with bytes is only for when the file names are turned to garbage.
Your file names (some of them) are turned to garbage. Fix them, and then
use file names as strings.
 
S

Steven D'Aprano

SyntaxError: invalid syntax


Dont know how to add a bytestremed path to a bytestream filename


Nikos, READ THE ERROR MESSAGE!!!

The error doesn't say anything about *adding*. It is a SyntaxError.

Please stop flooding us with dozens and dozens of trivial posts asking
the same questions over and over again. There are well over 120 posts in
this thread, it is impossible for anyone to keep track of it.


* Do not send a new post every time you make a small change to the code.

* Do not send a new post every time you make a typo and get a SyntaxError.

* READ THE ERROR MESSAGES and try to understand them first.

* SyntaxError means YOU HAVE MADE A TYPING MISTAKE.

* SyntaxError means that your code is not executed at all. Not a
single line of code is run. If no code is running, the problem
cannot possibly be with "add" or some other operation.

If your car will not start, the problem cannot be with the brakes.

If your program will not start, the problem cannot be with adding
two byte strings.

* You can fix syntax errors yourself. READ THE CODE that has the
syntax error and LOOK FOR WHAT IS WRONG. Then fix it.

* Don't tell us when you have fixed it. Nobody cares. Just fix it.

Here is the line of code again:

old_path = b'/home/nikos/public_html/data/apps/' + b'filename')


There is a syntax error in this line of code. Hint: here are some simple
examples of the same syntax error:

a = b + c)
x = y * z)
alist.sort())
assert 1+1 == 2)

Can you see the common factor? Each of those lines will give the same
syntax error as your line.
 
S

Steven D'Aprano

Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χÏήστης Steven D'Aprano
έγÏαψε:


errors='replace' mean dont break in case or error?

Please try reading the documentation for yourself before asking for help.

http://docs.python.org/3/library/stdtypes.html#bytes.decode


Yes, errors='replace' will mean that any time there is a decoding error,
the official Unicode "U+FFFD REPLACEMENT CHARACTER" will be used instead
of raising an error. Read the docs above, and follow the link, for more
information.

You took the unicode
's' string you utf-8 bytestringed it.

The word is "encoded".

Encoding: Unicode string => bytes
Decoding: bytes => Unicode string

Then how its possible to ask for
the utf8-bytestring to decode back to unicode string with the use of a
different charset that the one used for encoding and thsi actually
printed the filename in greek-iso?

Bytes are bytes, no matter where they come from. Bytes don't remember
whether they were from a Unicode string, or a float, or an integer, or a
list of pointers. All they know is that they are a sequence of values,
each value is 8 bits.

So bytes don't remember what charset (encoding) made them. If I have a
set of bytes, I can *try* to do anything I like with them:

* decode those bytes as ASCII
* decode those bytes as UTF-8
* decode those bytes as ISO-8859-7
* decode those bytes as a list of floats
* decode those bytes as a binary tree of pointers

If the bytes are not actually ASCII, or UTF-8, etc., then I will get
garbage, or an error.

Yes i feel this is the problem too.
Its a wonder to me why putty used by default greek-iso instead of utf-8
!!

Putty is probably getting the default charset from the Windows 8 system
you are using, and Windows is probably using Greek ISO-8859-7 for
compatibility with legacy data going back to Windows 95 or even DOS.

Someday everyone will use UTF-8, and this nonsense will be over.

Please explain this t me because now that i begin to understand this
encode/decode things i begin to like them!

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

http://nedbatchelder.com/text/unipain.html


a) WHAT does it mean when a linux system is set to use utf-8?

The Linux file system just treats file names as bytes. Any byte except
0x00 and 0x2f (ASCII '\0' and '/') are legal in file names, so the Linux
file system will store any other bytes.

But the applications on a Linux system don't work with bytes, they work
with text strings. You want to see a file name like "My Music.mp3", not
bytes like 0x4d79204d757369632e6d7033. So the applications need to know
how to encode their text strings (file names) into bytes, and how to
decode the file system bytes back into strings.

On Linux, there is a standard setting for doing this, the locale, which
by default is set to use UTF-8 as the standard encoding. So well-behaved
Linux applications will, directly or indirectly, interpret the bytes-on-
disk in file names as UTF-8, because that's what the locale tells them to
do.

On Windows, there is a complete different setting for doing this,
probably in the Registry.

b) WHAT does it mean when a terminal client is set to use utf-8?

Terminals need to accept bytes from the keyboard, and display them as
text to the user. So they need to know what encoding to use to change
bytes like 0x4d79204d757369632e6d7033 into something that is readable to
a human being, "My Music.mp3". That is the encoding.

c) WHAT happens when the two of them try to work together?

If they are set to the same encoding, everything just works.

If they are set to different encodings, you will probably have problems,
just as you are having problems.

(e-mail address removed) [~/www/cgi-bin]# echo $LS_OPTIONS
--color=tty -F -a -b -T 0

Is this okey? The '-b' option is for to display a filename in binary
mode?

That's fine.

Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays the
file in correct greek letters. Switching putty's encoding back to
'greek-iso' then the *displayed* filanames shows in mojabike.

WHAT is being displayed and what is actually stored as bytes is two
different thigns right?

Correct.

The bytes 0x200x40 means " @" (space at-sign) in ASCII or UTF-8, (and
also many other encodings), but it means CJK UNIFIED IDEOGRAPH-4020 in
UTF-16, it is invalid in UTF-32, and it means the number 32 as a 16-bit
integer. Bytes are just sets of 8-bit values. The *meaning* of those 8-
bit values depends on you, not the bytes themselves.

is the way the filaname is displayed in the terminal depending on the
encoding the terminal uses, correct? But no matter *how* its being
dislayed those two are the same file?

That's a hard question to answer. Sometimes yes, but not necessarily. It
will depend on how the terminal works, and how confused it gets.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,137
Messages
2,570,795
Members
47,342
Latest member
eixataze

Latest Threads

Top