Pythonic search of list of dictionaries

B

Bulba!

Hello everyone,

I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

...and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n['English'] == o['English']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.

I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.


#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)


try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)


titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')

cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))


print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)


# duplicates

dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpol')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all['English']==all[j]['English']) and
all['RowChanged']=='CHANGED':
dupes.writerow(all)
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()
 
T

Tim Hochberg

Bulba! said:
Hello everyone,

I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

..and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n['English'] == o['English']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.

Sure, just do a little preprocessing. Something like (untested):

####

def make_map(l):
# This assumes that each English key is unique in a given l
# if it's not you'll have to use a list of o instead of o itself.
map = {}
for d in l:
if 'English' in d:
key = d['English']
map[key] = d

old_map = make_map(oldl)
new_map = make_map(newl)

for engphrase in old_map:
if engphrase in new_map:
o = old_map[engphrase]
n = new_map[engphrase]
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
# process....

####

I've assumed that the English key is unique in both the old and new
lists. If it's not this will need some adjustment. However, your
original algorithm is going to behave weirdly in that case anyway
(spitting out multiple lines with the same id, but potentially different
new terms and update status).

Hope that's useful.

-tim
I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.


#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)


try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)


titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')

cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))


print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)


# duplicates

dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpol')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all['English']==all[j]['English']) and
all['RowChanged']=='CHANGED':
dupes.writerow(all)
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()
 
S

Skip Montanaro

Bulba> I put those dictionaries into the list:

Bulba> oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

Bulba> ..and then search for matching source terms in two loops:

Bulba> for o in oldl:
Bulba> for n in newl:
Bulba> if n['English'] == o['English']:
Bulba> ...

Bulba> Now, this works. However, not only this is very un-Pythonic, but
Bulba> also very inefficient: the complexity is O(n**2), so it scales up
Bulba> very badly.

How about using sets?

oenglish = set([item['English'] for item in oldl])
nenglish = set([item['English'] for item in newl])

matching = oenglish & nenglish

Once you have those that match, you can constrain your outer loop to just
those cases where

o['English'] in matching

If you're not using 2.4 yet, then get sets via:

from sets import Set as set

That's still not all that Pythonic, but should be a bit faster.

You might want to sort your lists by the 'English' key. I don't know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a['English'], b['English']))
newl.sort(lambda a,b: cmp(a['English'], b['English']))

Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.

Skip
 
S

Scott David Daniels

Skip said:
...lotsa great stuff ...
You might want to sort your lists by the 'English' key. I don't know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a['English'], b['English']))
newl.sort(lambda a,b: cmp(a['English'], b['English']))

To complete the thought, for 2.4 and after the new-fashioned way is:

import operator

oldl.sort(key=operator.itemgetter('English'))
newl.sort(key=operator.itemgetter('English'))
Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.
But overall you will have O(n log n) because of the sorts.

--Scott David Daniels
(e-mail address removed)
 
M

Marc 'BlackJack' Rintsch

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

If you don't "do" anything with each `x` you can write this as:

oldl = list(orig)

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

Bulba! wrote:
[big snip]

Forget the csv-induced dicts for the moment, they're just an artifact
of your first solution attempt. Whether english = csv_row[1], or
english = csv_row_dict["english"], doesn't matter yet. Let's take a few
steps back, and look at what you are trying to do through a telescope
instead of a microscope.

Core basic root problem: you have a table with 3 columns, id, english,
polish. Nothing is said about id so let's assume nothing. Evidently
contents of "english" is the "key" for this exercise, but it's not
necessarily unique. You have two instances of the table, and you want
to "diff" them using "english" as the key.

You want to collect together all rows that have the same value of
"english", and then process them somehow. You need to define an object
containing a list of all "left" rows and a list of all "right" rows.

Processing the contents of the object: do whatever you have to with
obj.left_list and obj.right_list depending on the lengths; you have 3
cases of length to consider (0, 1, many). 3 x 3 = 9 but if you get both
zero you have a bug (which you should of course assert does not exist),
so that leaves 8 cases to think about.

Now, how do we get the rows together:

(a) The sort method

Step 1: sort each dataset on (english, id, polish) or (english, polish,
id) -- your choice; sorting on the whole record instead just english
makes the ordering predictable and repeatable.

Step 2: read the two sorted datasets ("left" and "right") in parallel:

when left key < right key: do_stuff(); read another left record
when left key > right key: converse
when left ley == right key: do_stuff(); read another record for both
where do_stuff() includes appending to left_list and right_list as
appropriate and at the right moment, process a completed object.
This is a little tricky, and handling end of file needs a little care.
However this algorithm can be implemented with minimal memory ("core"),
no disk drive at all :-O and a minimum of 3 (preferably 4+)
serially-readable rewindable re-writable storage devices e.g magnetic
tape drives. Once upon a time, it was *the* way of maintaining a
database (left = old database, right = transactions, output file -> new
database).

(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj)

As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Hoping this sketch of the view through the telescope (and the
rear-vision mirror!) is helpful,
John
 
B

Bulba!

(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj)
As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

...but still don't understand why the difference is so big.

Both versions use local variables, etc. Both have their
lists initially sorted. Both essentially use a loop with
conditional for comparison, then process the record in the
same way. The overhead of second version is that it also
uses cmp() function and two additional integer
variables - that should not slow the program _so much_.

I have measured the run of snippet 2 with time checkpoints
written to a log (write time delta to log every 200 records),
even made a graph of time deltas in spreadsheet and in fact
snippet 2 seems after initial slowdown looks exactly linear,
like that:

^ (time)
|
| /-----------
| /
|/
---------------> (# of records written)

So yes, it would scale to big files. However, why is it so
frigging slow?!



# snippet 1, this runs in about 6 seconds
!def prepend(l):
! map = {}
! for d in l:
! key = d['English']
! map[key] = d
! return map
!
!old_map = prepend(oldl)
!new_map = prepend(newl)
!
!for engphrase in old_map:
! if engphrase in new_map:
! o = old_map[engphrase]
! n = new_map[engphrase]
! cm.writerow(matchpol(o,n))


# snippet 2, this needs 190 seconds
!while 1:
! if len(oldl) == 0 or len(newl) == 0:
! break
! if oldl[o]['English'] == newl[n]['English']:
! cm.writerow(matchpol(oldl[o], newl[n]))
! del oldl[o]
! del newl[n]
! o, n = 0, 0
! continue
! elif cmp(oldl[o]['English'], newl[n]['English']) < 0:
! if o == len(oldl):
! cm.writerow(newl[0])
! del(newl[0])
! o, n = 0, 0
! continue
! o+=1
! elif cmp(oldl[o]['English'], newl[n]['English']) > 0:
! if n == len(newl):
! cm.writerow(newl[0])
! del(oldl[0])
! o, n = 0, 0
! continue
! n+=1
 
S

Steven Bethard

Bulba! said:
Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

..but still don't understand why the difference is so big.
[snip]

# snippet 1, this runs in about 6 seconds
!def prepend(l):
! map = {}
! for d in l:
! key = d['English']
! map[key] = d
! return map
!
!old_map = prepend(oldl)
!new_map = prepend(newl)
!
!for engphrase in old_map:
! if engphrase in new_map:
! o = old_map[engphrase]
! n = new_map[engphrase]
! cm.writerow(matchpol(o,n))


# snippet 2, this needs 190 seconds
!while 1:
! if len(oldl) == 0 or len(newl) == 0:
! break
! if oldl[o]['English'] == newl[n]['English']:
! cm.writerow(matchpol(oldl[o], newl[n]))
! del oldl[o]
! del newl[n]
! o, n = 0, 0
! continue
! elif cmp(oldl[o]['English'], newl[n]['English']) < 0:
! if o == len(oldl):
! cm.writerow(newl[0])
! del(newl[0])
! o, n = 0, 0
! continue
! o+=1
! elif cmp(oldl[o]['English'], newl[n]['English']) > 0:
! if n == len(newl):
! cm.writerow(newl[0])
! del(oldl[0])
! o, n = 0, 0
! continue
! n+=1

I believe you're running into the fact that deleting from anywhere but
the end of a list in Python is O(n), where n is the number of items in
the list. Consider:

---------- test.py ----------
def delfromstart(lst):
while lst:
del lst[0]

def delfromend(lst):
for i in range(len(lst)-1, -1, -1):
del lst
-----------------------------

[D:\Steve]$ python -m timeit -s "import test"
"test.delfromstart(range(1000))"
1000 loops, best of 3: 1.09 msec per loop

[D:\Steve]$ python -m timeit -s "import test" "test.delfromend(range(1000))"
1000 loops, best of 3: 301 usec per loop

Note that Python lists are implemented basically as arrays, which means
that deleting an item from anywhere but the end of the list is O(n)
because all items in the list must be moved down to fill the hole.

Repeated deletes from a list are generally not the way to go, as your
example shows. =)

Steve
 
J

John Machin

Bulba! said:
(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english,
obj)
As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

..but still don't understand why the difference is so big.

Both versions use local variables, etc. Both have their
lists initially sorted. Both essentially use a loop with
conditional for comparison,
then process the record in the
same way.

"process the record in the same way"??? That's an interesting use of
"same".
The overhead of second version is that it also
uses cmp() function and two additional integer
variables - that should not slow the program _so much_.

I have measured the run of snippet 2 with time checkpoints
written to a log (write time delta to log every 200 records),
even made a graph of time deltas in spreadsheet and in fact
snippet 2 seems after initial slowdown looks exactly linear,
like that:

^ (time)
|
| /-----------
| /
|/
---------------> (# of records written)

So yes, it would scale to big files.

On your empirical evidence, as presented. However, do read on ...
However, why is it so
frigging slow?!

Mainly, because you are (unnecessarily) deleting the first item of a
list. This requires copying the remaining items. It is O(N), not O(1).
You are doing this O(N) times, so the overall result is O(N**2). Your
graph has no obvious explanation; after how many cycles does the speed
become constant?

Secondly, you are calling cmp() up to THREE times when once is enough.
Didn't it occur to you that your last elif needed an else to finish it
off, and the only possible action for the else suite was "assert
False"?

It would appear after reading your "snippet 2" a couple of times that
you are trying to implement the old 3-tape update method.

It would also appear that you are assuming/hoping that there are never
more than one instance of a phrase in either list.

You need something a little more like the following.

Note that in your snippet2 it was not very obvious what you want to do
in the case where a phrase is in "new" but not in "old", and vice versa
-- under one circumstance (you haven't met "end of file") you do
nothing but in the the other circumstance you do something but seem to
have not only a polarity problem but also a copy-paste-edit problem. In
the following code I have abstracted the real requirements as
handle_XXX_unmatched()

!o = n = 0
!lenold = len(oldl)
!lennew = len(newl)
!while o < lenold and n < lennew:
! cmp_result = cmp(oldl[o]['English'], newl[n]['English'])
! if cmp_result == 0:
! # Eng phrase is in both "new" and "old"
! cm.writerow(matchpol(oldl[o], newl[n]))
! o += 1
! n += 1
! elif cmp_result < 0:
! # Eng phrase is in "old", not in "new"
! handle_old_unmatched(o)
! o += 1
! else:
! assert cmp_result > 0 # :)
! # Eng phrase is in "new", not in "old"
! handle_new_unmatched(n)
! n += 1
!while o < lenold:
! # EOF on new, some old remain
! handle_old_unmatched(o)
! o += 1
!while n < lennew:
! # EOF on old, some new remain
! handle_new_unmatched(n)
! n += 1

Some general hints: Try stating your requirements clearly, to yourself
first and then to us. Try to ensure that your code is meeting those
requirements before you bother timing it. Try not to use single-letter
names -- in particularly using l (that's "L".lower(), not 1 i.e.
str(4-3)) is barf-inducing and makes people likely not to want to read
your code.

HTH,
John
 
B

Bulba!

"process the record in the same way"??? That's an interesting use of
"same".

Meaning when the records with the same 'english' key is found, just
write the comparison result to the result file.
On your empirical evidence, as presented. However, do read on ...
Mainly, because you are (unnecessarily) deleting the first item of a
list. This requires copying the remaining items. It is O(N), not O(1).
You are doing this O(N) times, so the overall result is O(N**2). Your
graph has no obvious explanation; after how many cycles does the speed
become constant?

After some 2000 records. I was "tracking" it with printouts of
variables and initially the main loop was taking advantage of
records being sorted, with the first condition (with keys
being equal) quickly writing the records into a file, and the
further into the dataset, the greater the distance was between
the beginning of the list where the record was sought to the
appropriate record, i.e. either o or n were becoming steadily
higher. This could explain linearity further down the path: the
list is becoming gradually shorter, so the deletion times
were decreasing linearly, but in the meantime the search time
was linearly increasing.
Secondly, you are calling cmp() up to THREE times when once is enough.
Didn't it occur to you that your last elif needed an else to finish it
off, and the only possible action for the else suite was "assert
False"?

Sure, but I was trying to make it shorter and absolutely clear
what I mean in this place (when people see the comparison in every
place, they see immediately what it was, they don't have to
recall the variable). Obviously I'd optimise it in practice. It
was supposed to be "proof of concept" rather than any working
code.

BTW, this particular example didn't work out, but I still found
Python to be the most convenient language for prototyping, I
wrote smth like five versions of this thing, just to learn
dictionaries, lists, sets, etc. In no other language I'd do
it so quickly.
It would appear after reading your "snippet 2" a couple of times that
you are trying to implement the old 3-tape update method.

Well, ahem, I admit you got me curious just how it would work
with tapes (never used them), so I was sort of trying to simulate
that - it's just a bit weird undertaking, I did it rather to explore
the issue and try to learn smth rather than to get working code.

Deleting the first item from a list was to be a convenient
equivalent of forwarding the reading head on the tape one
record ahead, because I assumed that any deletion from the
list was to take more or less the same time, just like reading
heads on tapes were probably reading the records with similar
speeds regardless of what length of tape was already read.
It would also appear that you are assuming/hoping that there are never
more than one instance of a phrase in either list.

Sure. Because using advice of Skip Montanaro I initially used sets
to eliminate duplicates. It's just not shown for brevity. If the
keys are guaranteed to be unique, it makes it easier to think
about the algorithm.
You need something a little more like the following.
Note that in your snippet2 it was not very obvious what you want to do
in the case where a phrase is in "new" but not in "old", and vice versa
-- under one circumstance (you haven't met "end of file") you do
nothing but in the the other circumstance you do something but seem to
have not only a polarity problem but also a copy-paste-edit problem.

What exactly is a "polarity problem"?

I concede that the code is rather contrived, but I didn't bother
to do smth like writing classes or functions that would simulate
a tape reader, so the result is rather ugly. I only posted it
because I could not explain why it were so slow.
In
the following code I have abstracted the real requirements as
handle_XXX_unmatched()

Thanks, I'll have to try that out.
 
B

Bulba!

Note that Python lists are implemented basically as arrays, which means
that deleting an item from anywhere but the end of the list is O(n)
because all items in the list must be moved down to fill the hole.

Ouch...



--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.
 
J

John Machin

Bulba! said:
Sure, but I was trying to make it shorter and absolutely clear
what I mean in this place (when people see the comparison in every
place, they see immediately what it was, they don't have to
recall the variable). Obviously I'd optimise it in practice. It
was supposed to be "proof of concept" rather than any working
code.

Three is shorter than one? See immediately? I have to disagree. People
would be more put off by looking at your overly-complicated comparison
three times and having to check character-by-character that it was
doing exactly the same thing each time (i.e. your copy/paste had
worked) than by "recalling" a sensibly-named variable like
"cmp_result".
BTW, this particular example didn't work out, but I still found
Python to be the most convenient language for prototyping, I
wrote smth like five versions of this thing, just to learn
dictionaries, lists, sets, etc. In no other language I'd do
it so quickly.

NOW we're on the same bus.
Well, ahem, I admit you got me curious just how it would work
with tapes (never used them), so I was sort of trying to simulate
that - it's just a bit weird undertaking, I did it rather to explore
the issue and try to learn smth rather than to get working code.

The 3-tape technique is worth understanding, for handling datasets that
won't fit into memory. More generally, when learning, you *need* to get
working code out of your practice exercises. Otherwise what are you
learning? You don't want to wait until you have two 10GB datasets to
"diff" before you start on thinking through, implementing and testing
what to do on end-of-file.
Deleting the first item from a list was to be a convenient
equivalent of forwarding the reading head on the tape one
record ahead, because I assumed that any deletion from the
list was to take more or less the same time, just like reading
heads on tapes were probably reading the records with similar
speeds regardless of what length of tape was already read.

A reasonable assumption; however if the reading stopped and restarted,
an inordinate amount of time was taken accelerating the drive up to
reading speed again. The idea was to keep the tape moving at all times.
This required techiques like double-buffering, plus neat clean brief
fast processing routines.

.... but you mixed in moving the indexes (with e.g. o+=1) with confusing
results.

Deleting the first item of a list in this circumstance reminds me of
the old joke: "Q: How many members of military organisation X does it
take to paint the barracks? A: 201, 1 to hold the paint-brush and 200
to lift the barracks and rotate it."

Tip 1: Once you have data in memory, don't move it, move a pointer or
index over the parts you are inspecting.

Tip 2: Develop an abhorrence of deleting data.
Sure. Because using advice of Skip Montanaro I initially used sets
to eliminate duplicates.

I see two problems with your implementation of Skip's not-unreasonable
suggestion. One: You've made the exercise into a multi-pass algorithm.
Two: As I posted earlier, you need to get all the instances of the same
key together at the same time. Otherwise you get problems. Suppose you
have in each list, two translations apple -> polish1 and apple ->
polish2. How can you guarantee that you remove the same duplicate from
each list, if you remove duplicates independently from each list?
It's just not shown for brevity.

Then say so. On the evidence, brevity was not a plausible explanation
for the absence. :)
If the
keys are guaranteed to be unique, it makes it easier to think
about the algorithm.

Unfortunately in the real world you can't just imagine the problems
away. Like I said, you have to think about the requirements first -- 9
cases of (0, 1, many) x (0, 1, many); what do you need to do? Then
think about an implementation.
What exactly is a "polarity problem"?

You appeared to be treating "old" like I would have expected you to
treat "new" and vice versa.
I concede that the code is rather contrived, but I didn't bother
to do smth like writing classes or functions that would simulate
a tape reader,
so the result is rather ugly. I only posted it
because I could not explain why it were so slow.

Was what I posted ugly? Did you see in it any classes or functions
simulating tape drives?
 
A

Andrea Griffini

Tip 1: Once you have data in memory, don't move it, move a pointer or
index over the parts you are inspecting.

Tip 2: Develop an abhorrence of deleting data.

I've to admit that I also found strange that deleting the
first element from a list is not O(1) in python. My wild
guess was that the extra addition and normalization required
to have insertion in amortized O(1) and deletion in O(1) at
both ends of a random access sequence was going to have
basically a negligible cost for normal access (given the
overhead that is already present in python).

But I'm sure this idea is too obvious for not having been
proposed, and so there must reasons for refusing it
(may be the cost to pay for random access once measured was
found being far from negligible, or that the extra memory
overhead per list - one int for remembering where the live
data starts - was also going to be a problem).

Andrea
 
J

John Machin

Andrea said:
I've to admit that I also found strange that deleting the
first element from a list is not O(1) in python. My wild
guess was that the extra addition and normalization required
to have insertion in amortized O(1) and deletion in O(1) at
both ends of a random access sequence was going to have
basically a negligible cost for normal access (given the
overhead that is already present in python).

But I'm sure this idea is too obvious for not having been
proposed, and so there must reasons for refusing it
(may be the cost to pay for random access once measured was
found being far from negligible, or that the extra memory
overhead per list - one int for remembering where the live
data starts - was also going to be a problem).

My wild guess: Not a common use case. Double-ended queue is a special
purpose structure.

Note that the OP could have implemented the 3-tape update simulation
efficiently by reading backwards i.e. del alist[-1]

<humour>
Suggested projects for you, in increasing order of difficulty:

1. Grab the source code (listobject.c) and report back on how you would
implement your proposal.
2. Convince folk that your implementation is faster and more robust and
has beter internal documentation than anything the timbot could ever
write.
3. Write a PEP that didn't cause a flamewar.
</humour>
 
K

Kent Johnson

Andrea said:
I've to admit that I also found strange that deleting the
first element from a list is not O(1) in python. My wild
guess was that the extra addition and normalization required
to have insertion in amortized O(1) and deletion in O(1) at
both ends of a random access sequence was going to have
basically a negligible cost for normal access (given the
overhead that is already present in python).

This was added to Python 2.4 as collections.deque

Kent
 
A

Andrea Griffini

My wild guess: Not a common use case. Double-ended queue is a special
purpose structure.

Note that the OP could have implemented the 3-tape update simulation
efficiently by reading backwards i.e. del alist[-1]

Note that I was just trying to say that it's not obvious
that list insertion at the first element is O(n); because
there are "less naive" implementation that can do better.
For a lower level language O(n) is probably what 99% of
programmers indeed would expect, but for a VHLL like
python this is IMO not the case.

I remember when a few years ago working with PowerBuilder
(a RAD environment for client-server applications) to my
great surprise I found that even adding at the end of a
list was O(n) in that language... where is the line ?
After all "smart" reallocation is still a tradeoff (extra
"wasted" space traded for diminshed copying) ...

Andrea
 
N

Nick Coghlan

John said:
My wild guess: Not a common use case. Double-ended queue is a special
purpose structure.

As Kent said, the suggestion of making index 0 insertions and deletions on lists
more efficent was made, and the decision was to leave list alone and provide
collections.deque instead. This let deque sacrifice some of list's flexibility
in favour of increased speed.

Appropriate parts of the core which needed a FIFO were then updated to use the
new data type, while everything else continues to use lists.

Cheers,
Nick.
 
B

Bulba!

I've to admit that I also found strange that deleting the
first element from a list is not O(1) in python. My wild
guess was that the extra addition and normalization required
to have insertion in amortized O(1) and deletion in O(1) at
both ends of a random access sequence was going to have
basically a negligible cost for normal access (given the
overhead that is already present in python).

Smth similar happened here - I _assumed_ that since lists in
Python can hold arbitrary objects, the low level C implementation
was smth like "array" of pointers to those objects (or offsets of
those objects stored elsewhere).

If so, deleting an element from a list would merely be deleting a
pointer (or zeroing particular offset to get this object "released"),
which would have negligible cost.

I've read that Python's memory management technique
is 'reference counting' - it would seem only logical that
the list is an array (or linked list or linked list of arrays
maybe, as it can grow in size arbitrarily) of such
references.
But I'm sure this idea is too obvious for not having been
proposed, and so there must reasons for refusing it
(may be the cost to pay for random access once measured was
found being far from negligible, or that the extra memory
overhead per list - one int for remembering where the live
data starts - was also going to be a problem).

But either the book-keeping of offsets or pointers or references
to list elements has to be done by Python interpreter, oops,
VIRTUAL MACHINE, somehow anyway.

I don't see why should deleting element from a list be O(n), while
saying L[0]='spam' when L[0] previously were, say, 's', not have the
O(n) cost, if a list in Python is just an array containing the
objects itself?

Why should JUST deletion have an O(n) cost?



--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.
 
A

Andrea Griffini

I don't see why should deleting element from a list be O(n), while
saying L[0]='spam' when L[0] previously were, say, 's', not have the
O(n) cost, if a list in Python is just an array containing the
objects itself?

Why should JUST deletion have an O(n) cost?

Because after deletion L[1] moved to L[0], L[2] moved to L[1],
L[3] moved to L[2] and so on. To delete the first element you
have to move n-1 pointers and this is where O(n) comes from.
When you reassign any element there is no need to move the
others around, so that's why you have O(1) complexity.

With a data structure slightly more complex than an array
you can have random access in O(1), deletion of elements
O(1) at *both ends* and insertion in amortized O(1) at
*both ends*. This data structure is called doubly-ended
queque (nickname "deque") and is available in python.

The decision was that for the basic list object the overhead
added by deques for element access (it's still O(1), but a
bit more complex that just bare pointer arithmetic) and, I
guess, the hassle of changing a lot of working code and
breaking compatibility with extensions manipulating directly
lists (no idea if such a thing exists) was not worth the gain.

The gain would have been that who doesn't know what O(n)
means and that uses lists for long FIFOs would get fast
programs anyway without understanding why. With current
solution they just have to use deques instead of lists.

After thinking to it for a while I agree that this is a
reasonable choice. The gain is anyway IMO very little
because if a programmer desn't understand what O(n) is
then the probability that any reasonably complex program
written is going to be fast is anyway zero... time would
just be wasted somewhere else for no reason.

Andrea
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top