B
Bulba!
Hello everyone,
I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.
The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:
o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}
I put those dictionaries into the list:
oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...
...and then search for matching source terms in two loops:
for o in oldl:
for n in newl:
if n['English'] == o['English']:
...
Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.
What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.
I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.
Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.
I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------
import sys
import csv
class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False
epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)
try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)
try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')
cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"
oldl=[x for x in orig]
newl=[x for x in transl]
all=[]
for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)
# duplicates
dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpol')
dupes.writerow(dict(zip(titles,titles)))
"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all['English']==all[j]['English']) and
all['RowChanged']=='CHANGED':
dupes.writerow(all)
dupes.writerow(all[j])"""
cfile.close()
ofile.close()
tfile.close()
dfile.close()
I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.
The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:
o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}
I put those dictionaries into the list:
oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...
...and then search for matching source terms in two loops:
for o in oldl:
for n in newl:
if n['English'] == o['English']:
...
Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.
What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.
I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.
Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.
I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------
import sys
import csv
class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False
epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)
try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)
try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')
cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"
oldl=[x for x in orig]
newl=[x for x in transl]
all=[]
for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)
# duplicates
dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpol')
dupes.writerow(dict(zip(titles,titles)))
"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all['English']==all[j]['English']) and
all['RowChanged']=='CHANGED':
dupes.writerow(all)
dupes.writerow(all[j])"""
cfile.close()
ofile.close()
tfile.close()
dfile.close()