Beginner Question : Iterators and zip

M

moogyd

Hi group,

I have a basic question on the zip built in function.

I am writing a simple text file comparison script, that compares line
by line and character by character. The output is the original file,
with an X in place of any characters that are different.

I have managed a solution for a fixed (3) number of files, but I want
a solution of any number of input files.

The outline of my solution:

for vec in zip(vec_list[0],vec_list[1],vec_list[2]):
res = ''
for entry in zip(vec[0],vec[1],vec[2]):
if len(set(entry)) > 1:
res = res+'X'
else:
res = res+entry[0]
outfile.write(res)

So vec is a tuple containing a line from each file, and then entry is
a tuple containg a character from each line.

2 questions
1) What is the general solution. Using zip in this way looks wrong. Is
there another function that does what I want
2) I am using set to remove any repeated characters. Is there a
"better" way ?

Any other comments/suggestions appreciated.

Thanks,

Steven
 
B

bruno.desthuilliers

Hi group,

I have a basic question on the zip built in function.

I am writing a simple text file comparison script, that compares line
by line and character by character. The output is the original file,
with an X in place of any characters that are different.

I have managed a solution for a fixed (3) number of files, but I want
a solution of any number of input files.

The outline of my solution:

for vec in zip(vec_list[0],vec_list[1],vec_list[2]):
res = ''
for entry in zip(vec[0],vec[1],vec[2]):
if len(set(entry)) > 1:
res = res+'X'
else:
res = res+entry[0]
outfile.write(res)

So vec is a tuple containing a line from each file, and then entry is
a tuple containg a character from each line.

2 questions
1) What is the general solution. Using zip in this way looks wrong. Is
there another function that does what I want

zip is (mostly) ok. What you're missing is how to use it for any
arbitrary number of sequences. Try this instead:
lists = [range(5), range(5,11), range(11, 16)]
lists [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
for item in zip(*lists):
.... print item
....
(0, 5, 11)
(1, 6, 12)
(2, 7, 13)
(3, 8, 14)
(4, 9, 15)
lists = [range(5), range(5,11), range(11, 16), range(16, 20)]
for item in zip(*lists):
.... print item
....
(0, 5, 11, 16)
(1, 6, 12, 17)
(2, 7, 13, 18)
(3, 8, 14, 19)
The only caveat with zip() is that it will only use as many items as
there are in your shorter sequence, ie:
zip(range(3), range(10)) [(0, 0), (1, 1), (2, 2)]
zip(range(30), range(10))
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8,
8), (9, 9)]
So you'd better pad your sequences to make them as long as the longer
one. There are idioms for doing this using the itertools package's
chain and repeat iterators, but I'll leave concrete example as an
exercice to the reader !-)
2) I am using set to remove any repeated characters. Is there a
"better" way ?

That's probably what I'd do too.
Any other comments/suggestions appreciated.

There's a difflib package in the standard lib. Did you give it a try ?
 
T

Terry Reedy

Hi group,

I have a basic question on the zip built in function.

I am writing a simple text file comparison script, that compares line
by line and character by character. The output is the original file,
with an X in place of any characters that are different.

I have managed a solution for a fixed (3) number of files, but I want
a solution of any number of input files.

The outline of my solution:

for vec in zip(vec_list[0],vec_list[1],vec_list[2]):
res = ''
for entry in zip(vec[0],vec[1],vec[2]):
if len(set(entry)) > 1:
res = res+'X'
else:
res = res+entry[0]
outfile.write(res)

So vec is a tuple containing a line from each file, and then entry is
a tuple containg a character from each line.

2 questions
1) What is the general solution. Using zip in this way looks wrong. Is
there another function that does what I want

zip(*vec_list) will zip together all entries in vec_list
Do be aware that zip stops on the shortest iterable. So if vec[1] is
shorter than vec[0] and matches otherwise, your output line will be
truncated. Or if vec[1] is longer and vec[0] matches as far as it goes,
there will be no signal either.

res=rex+whatever can be written as res+=whatever
2) I am using set to remove any repeated characters. Is there a
"better" way ?

I might have written a third loop to compare vec[0] to vec[1]..., but
your set solution is easier and prettier.

If speed is an issue, don't rebuild the output line char by char. Just
change what is needed in a mutable copy. I like this better anyway.

res = list(vec[0]) # if all ascii, in 3.0 use bytearray
for n, entry in enumerate(zip(vec[0],vec[1],vec[2])):
if len(set(entry)) > 1:
res[n] = 'X'
outfile.write(''.join(res)) # in 3.0, write(res)

tjr
 
M

moogyd

On 12 juil, 20:55, (e-mail address removed) wrote:



zip is (mostly) ok. What you're missing is how to use it for any
arbitrary number of sequences. Try this instead:
lists = [range(5), range(5,11), range(11, 16)]
lists

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]>>> for item in zip(*lists):

... print item
...
(0, 5, 11)
(1, 6, 12)
(2, 7, 13)
(3, 8, 14)
(4, 9, 15)

What is this *lis operation called? I am having trouble finding any
reference to it in the python docs or the book learning python.
There's a difflib package in the standard lib. Did you give it a try ?

I'll check it out, but I am a newbie, so I am writing this as a
(useful) learning excercise.

Thanks for the help

Steven
 
T

Terry Reedy

What is this *lis operation called? I am having trouble finding any
reference to it in the python docs or the book learning python.

One might call this argument unpacking, but
Language Manual / Expressions / Primaries / Calls
simply calls it *expression syntax.
"If the syntax *expression appears in the function call, expression must
evaluate to a sequence. Elements from this sequence are treated as if
they were additional positional arguments; if there are positional
arguments x1,...,*xN* , and expression evaluates to a sequence
y1,...,*yM*, this is equivalent to a call with M+N positional arguments
x1,...,*xN*,*y1*,...,*yM*."

See Compound Statements / Function definitions for the mirror syntax in
definitions.

tjr
 
C

cokofreedom

zip(*vec_list) will zip together all entries in vec_list
Do be aware that zip stops on the shortest iterable. So if vec[1] is
shorter than vec[0] and matches otherwise, your output line will be
truncated. Or if vec[1] is longer and vec[0] matches as far as it goes,
there will be no signal either.

Do note that from Python 3.0 there is another form of zip that will
read until all lists are exhausted, with the other being filled up
with a settable default value. Very useful!
 
M

moogyd

One might call this argument unpacking, but
Language Manual / Expressions / Primaries / Calls
simply calls it *expression syntax.
"If the syntax *expression appears in the function call, expression must
evaluate to a sequence. Elements from this sequence are treated as if
they were additional positional arguments; if there are positional
arguments x1,...,*xN* , and expression evaluates to a sequence
y1,...,*yM*, this is equivalent to a call with M+N positional arguments
x1,...,*xN*,*y1*,...,*yM*."

See Compound Statements / Function definitions for the mirror syntax in
definitions.

tjr

Thanks,

It's starting to make sense :)

Steven
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top