are elements of a list in sequence in list b

Matthew_WARREN · Feb 8, 2008

Hallo,

I need to search list a for the sequence of list b

First I went

a=[1,2,34,4,5,6]
b=[2,3,4]
a in b

Click to expand...

Click to expand...

False

So

''.join([ v.__str__() for v in b ]) in ''.join([ v.__str__() for v in a ])

s=SomeObject()
a=[1,2,3,[s,3,[4,s,9]],s,4]
b=[3,[s,3,[4,s,9]],s,4]
''.join([ v.__str__() for v in b ]) in ''.join([ v.__str__() for v in a

Click to expand...

Click to expand...

])
True

Was my second guess. I thought about set() but it doesn't see element order
as important.

That feels like a kludge to me, but makes python do the work of searching.

I know there are fast algorithmic ways of achieving it by hand, but my
lists are only 20 to 30 elements long, and I will be making max 100
searches at a time, probably about once every 2 or 3 seconds.

So speed not greatly important, A feeling I've written good code and taken
a good approach is. I've hunted for a recipe but couldn't find one to look
at and see how it was done.

Matt.
--
--

This message and any attachments (the "message") is
intended solely for the addressees and is confidential.
If you receive this message in error, please delete it and
immediately notify the sender. Any use not in accord with
its purpose, any dissemination or disclosure, either whole
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message.
BNP PARIBAS (and its subsidiaries) shall (will) not
therefore be liable for the message if modified.
Do not print this message unless it is necessary,
consider the environment.

---------------------------------------------

Ce message et toutes les pieces jointes (ci-apres le
"message") sont etablis a l'intention exclusive de ses
destinataires et sont confidentiels. Si vous recevez ce
message par erreur, merci de le detruire et d'en avertir
immediatement l'expediteur. Toute utilisation de ce
message non conforme a sa destination, toute diffusion
ou toute publication, totale ou partielle, est interdite, sauf
autorisation expresse. L'internet ne permettant pas
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.

Paul Hankin · Feb 8, 2008

[email protected] said:
I need to search list a for the sequence of list b

def list_contains(a, b):
return any(a[i:i+len(b)] == b for i in range(len(a) - len(b) + 1))

list_contains(range(1, 7), [2, 3, 4])

Arnaud Delobelle · Feb 8, 2008

[email protected] said:
[email protected] said:

I need to search list a for the sequence of list b

Click to expand...

def list_contains(a, b):
return any(a[i:i+len(b)] == b for i in range(len(a) - len(b) + 1))

list_contains(range(1, 7), [2, 3, 4])

This is more careful but not necessarilly faster in python:

def issubseq(s, l):
indices = [0]
try:
for x in l:
indices = [i+1 for i in indices if s == x]
indices.append(0)
return indices[0] == len(s)
except IndexError:
return True

issubseq('m&e', 'spam&eggs)

Tomek Paczkowski · Feb 8, 2008

Hallo,

I need to search list a for the sequence of list b

I guess most pythonic would be sth like this:

Code:

def naive_seq_match(sequence, pattern):
    """Serches for pattern in sequence. If pattern is found
    returns match start index, else returns None"""
    plen = len(pattern)
    for ind in xrange(len(sequence) - plen + 1):
        if sequence[ind:ind+plen] == pattern:
            return ind
    return None

It can be used with every sequence, not only lists. If its not fast enough
(its pessimistic complexity is O(n^2)), you can try Knuth-Morris-Pratt
algorithm (its O(n), but requires some precomputation).

Steven D'Aprano · Feb 8, 2008

Hallo,

I need to search list a for the sequence of list b

First I went

a=[1,2,34,4,5,6]
b=[2,3,4]
a in b

Click to expand...

Click to expand...

False

You have the order of a and b mixed up there. As you have it, you are
searching b for a.

So

''.join([ v.__str__() for v in b ]) in ''.join([ v.__str__() for v in a
])

Ergh!!! What's with calling the special method __str__? This will do the
same thing and be infinitely less ugly:

''.join([str(v) for v in b]) in ''.join([str(v) for v in a])

This is even simpler:

str(b)[1:-1] in str(a)

but it is still a nasty kludge.

Unfortunately, both kludges fail with some fairly obvious sets of data.
For example:

a = [1, 2, 3, '4+4 is 8', 12]
b = [3, 4]
''.join([v.__str__() for v in b]) in ''.join([v.__str__() for v in a])

Click to expand...

Click to expand...

True

a = [1, 2, '()[]{}', 3]
b = [[]]
str(b)[1:-1] in str(a)

Click to expand...

Click to expand...

True

We could mess about with different versions, maybe use repr() instead of
str(), but it's a losing battle: there will always be some set of data
that will fail.

[snip]

I know there are fast algorithmic ways of achieving it by hand, but my
lists are only 20 to 30 elements long, and I will be making max 100
searches at a time, probably about once every 2 or 3 seconds.

What you're trying to do is the equivalent of substring searching. There
are simple algorithms that do this, and there are fast algorithms, but
the simple ones aren't fast and the fast ones aren't simple.

However, for your use case, the simple algorithm is probably fast enough:

def is_sublist(b, a):
"""Return True if b is a sub-list of a, otherwise False"""
# Get the simple cases out of the way first, for speed.
if len(a) < len(b): return False
elif a == b: return True
elif not b: return False # Or maybe True, if you prefer.
lb = len(b)
for i in xrange(len(a)-lb+1):
if a[i:i+lb] == b:
return True
return False

bearophileHUGS · Feb 9, 2008

Steven D'Aprano:

What you're trying to do is the equivalent of substring searching. There
are simple algorithms that do this, and there are fast algorithms, but
the simple ones aren't fast and the fast ones aren't simple.
However, for your use case, the simple algorithm is probably fast enough:

def is_sublist(b, a):
"""Return True if b is a sub-list of a, otherwise False"""
# Get the simple cases out of the way first, for speed.
if len(a) < len(b): return False
elif a == b: return True
elif not b: return False # Or maybe True, if you prefer.
lb = len(b)
for i in xrange(len(a)-lb+1):
if a[i:i+lb] == b:
return True
return False

I think a compromise can be found, this seem to work, and fast enough
(Psyco helps a lot of this kind of code):

def issubseq(sub, items):
"""issubseq(sub, items): return true if the sequence 'sub' is a
contiguous
subsequence of the 'items' sequence.
Traceback (most recent call last):
...
TypeError: issubseq() takes exactly 2 arguments (0 given) Traceback (most recent call last):
...
TypeError: issubseq() takes exactly 2 arguments (1 given)

>>> issubseq(1, [1, 2, 3])

Click to expand...

Click to expand...

Traceback (most recent call last):
...
TypeError: object of type 'int' has no len()

>>> isi = lambda s,i: int(issubseq(s,i))
>>> isi([], []) 1
>>> isi("a", "") 0
>>> isi("", "a") 1
>>> isi("", "aaa") 1
>>> isi("a", "a") 1
>>> isi("ab", "bab") 1
>>> isi("ab", "bab") 1
>>> isi("ba", "bbb") 0
>>> isi("bab", "ab") 0
>>> isi(("a", "b"), ("a","a","b")) 1
>>> isi(("a", "b"), ("a","a","c")) 0
>>> isi([1,2,1], [3,5, 1,2,4, 1,2,1, 6]) 1
>>> isi([1,2,1], [3,5, 1,2,4, 1,2,3, 6]) 0
>>> l = [1] * 100 + [1,2,3] + [4] * 100
>>> isi([1,2,3], l) 1
>>> l = [1] * 100 + [1,2,4] + [5] * 100
>>> isi([1,2,3], l)

Click to expand...

Click to expand...

0
"""
len_sub = len(sub)
len_items = len(items)
if len_sub == 0:
return True
if len_sub > len_items:
return False
if len_items == 0:
return False
if sub == items:
return True

table = [0] * (len_sub + 1)

# building prefix-function
m = 0
for i in xrange(1, len_sub):
while m > 0 and sub[m] != sub:
m = table[m - 1]
if sub[m] == sub:
m += 1
table = m

# searching
m, i = 0, 0
for x in items:
while m > 0 and sub[m] != x:
m = table[m - 1]
if sub[m] == x:
m += 1
if m == len_sub:
return True
i += 1

return False

def _is_sublist(sub, items):
"""Return True if sub is a sub-list of items, otherwise False.
Traceback (most recent call last):
...
TypeError: _is_sublist() takes exactly 2 arguments (0 given) Traceback (most recent call last):
...
TypeError: _is_sublist() takes exactly 2 arguments (1 given)

>>> _is_sublist(1, [1, 2, 3])

Click to expand...

Click to expand...

Click to expand...

Traceback (most recent call last):
...
TypeError: object of type 'int' has no len()

>>> isi = lambda s,i: int(_is_sublist(s,i))
>>> isi([], []) 1
>>> isi("a", "") 0
>>> isi("", "a") 1
>>> isi("", "aaa") 1
>>> isi("a", "a") 1
>>> isi("ab", "bab") 1
>>> isi("ab", "bab") 1
>>> isi("ba", "bbb") 0
>>> isi("bab", "ab") 0
>>> isi(("a", "b"), ("a","a","b")) 1
>>> isi(("a", "b"), ("a","a","c")) 0
>>> isi([1,2,1], [3,5, 1,2,4, 1,2,1, 6]) 1
>>> isi([1,2,1], [3,5, 1,2,4, 1,2,3, 6]) 0
>>> l = [1] * 100 + [1,2,3] + [4] * 100
>>> isi([1,2,3], l) 1
>>> l = [1] * 100 + [1,2,4] + [5] * 100
>>> isi([1,2,3], l)

Click to expand...

Click to expand...

Click to expand...

0
"""
# By Steven D'Aprano, modified
len_sub = len(sub)
len_items = len(items)
if len_sub == 0:
return True
if len_sub > len_items:
return False
if len_items == 0:
return False
if sub == items:
return True

for i in xrange(len_items - len_sub + 1):
if items[i : i+len_sub] == sub:
return True
return False

if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests finished.\n"

def subsequence_benchmark(with_psyco=False):
from random import randrange
from time import clock

N = 100000
NBIG = 1000

small_sub = [1, 2, 3]
big_sub = [randrange(10) for _ in xrange(NBIG)]

small_present = [1] * N + small_sub + [N] * N
small_absent1 = [1] * N + small_sub[:-1] + [N] * N
small_absent2 = small_sub[:-1] * N
big_present = [1] * N + big_sub + [N] * N
big_absent = [1] * N + big_sub[:-1] + [N] * N

test_cases = [(small_sub, small_present), (small_sub,
small_absent1),
(small_sub, small_absent2),
(big_sub, big_present), (big_sub, big_absent)]
algos = [issubseq, _is_sublist]

if with_psyco:
import psyco
for algo in algos:
psyco.bind(algo)
print "With psyco"
else:
print "Without psyco"

print "N, NBIG =", N, NBIG
for sub, seq in test_cases:
print "len(sub),len(seq)= %d %d" % (len(sub), len(seq))
result = []
for finder in algos:
t0 = clock()
r = finder(sub, seq)
t1 = clock()
result.append(r)
print " %s: %s (time: %.02f)" %(finder.func_name, r,
round(t1-t0,2))
assert(all(r==True for r in result) or all(r==False for r
in result))

#subsequence_benchmark(with_psyco=True)

There are ways to use less memory or to go faster, but they may be
overkill:
http://www.icu-project.org/docs/papers/efficient_text_searching_in_java.html

Bye,
bearophile

Steven D'Aprano · Feb 9, 2008

def issubseq(sub, items):
"""issubseq(sub, items): return true if the sequence 'sub' is a
contiguous subsequence of the 'items' sequence.

[snip]

A stylistic note for you...

I believe that it is considered an abuse of doctest to write a function
with 28 lines of code and 19 tests (about two tests per three LOC). Your
second function (adapted from mine) with 18 tests and only 14 LOC is even
more abusive.

(At least, *I* consider it an abuse of doctest.)

As I understand it, doctests are not supposed to be used to cover trivial
and uninteresting cases, and many of your tests were trivial and
uninteresting. Doc tests are meant to be first and foremost
*documentation*. For extensive, cover-every-case testing, you should be
using unit testing.

For example, given that your docstring stated that both arguments to
issubseq() are sequences, I question the need to have a doctest showing
issubseq() fail when one of the arguments is not a sequence. In my
opinion, that *doc* test is redundant (it shows nothing the reader hasn't
already seen), and should be a unit test.

And another thing... your timing code does this:

from time import clock

Linux, Unix and Macintosh users of your code may hate you for this. As
the timeit module explains:

... on Windows, clock() has microsecond granularity but time()'s
granularity is 1/60th of a second; on Unix, clock() has 1/100th
of a second granularity and time() is much more precise.

If you're timing fast code, it's probably a good idea to follow Tim
Peters' recommendation and use time() on Unix and clock() on Windows. An
easy way to do that is by importing timeit.default_timer.

Paddy · Feb 9, 2008

def issubseq(sub, items):
"""issubseq(sub, items): return true if the sequence 'sub' is a
contiguous subsequence of the 'items' sequence.

Click to expand...

[snip]

A stylistic note for you...

I believe that it is considered an abuse of doctest to write a function
with 28 lines of code and 19 tests (about two tests per three LOC). Your
second function (adapted from mine) with 18 tests and only 14 LOC is even
more abusive.

(At least, *I* consider it an abuse of doctest.)

Hi Steven,
I disagree with you there. A large problem, along with poor
documentation, is no testing.
If doctest makes it much easier to test, use doctest. If you prefer
other methods then use them, but applaud Bearophile for not only
testing, but showing his tests. If it doesn't interest you then skip
it, but it at least has shown that Bearophile thought to test his
code.

The blatent and interesting bugs are usually the first thought of, and
found. It usually takes thorough analysis of the program under testing
before it is wise to remove a test and trivial, or uninteresting are
not usually reasons to remove a test wothout qualification.

Bearophile might want to use another file for his doctests, but that
would hardly help when posting to usenet.

- Paddy.

bearophileHUGS · Feb 9, 2008

Steven D'Aprano:

I believe that it is considered an abuse of doctest to write a function
with 28 lines of code and 19 tests

I agree with you. Most of my functions/methods have more compact
doctests (putting things on many lines has the advantage of letting
you locate the failing tests in a simpler way).

In my opinion, that *doc* test is
redundant (it shows nothing the reader hasn't
already seen), and should be a unit test.

Doctests are simpler and faster to use for me, and they have the
advantage of keeping tests close to the tested things, instead of
scattering things in two or more places.
Generally between the choice of less tests or more doctests, I prefer
more doctests. If you have infinite time you don't need to take such
choices.

... on Windows, clock() has microsecond granularity but time()'s
granularity is 1/60th of a second; on Unix, clock() has 1/100th
of a second granularity and time() is much more precise.

I did know this... Maybe the a single function can be put there, fit
for everyone.

An easy way to do that is by importing timeit.default_timer.

I didn't know about this, thank you. I think the timeit docs don't
show that function well enough, I can't find it in a quick search. And
that name is quite long to write, so maybe a shorter name is better
(like "timer").

Bye and thank you,
bearophile

brain stuck. whats occurring here?	10	Feb 7, 2008
Fw: Undeliverable Message	1	Jan 25, 2008
validate string is valid maths	2	Jan 28, 2008
Measuring memory using ruby-prof - help required	0	May 30, 2011
Filter sober in c++ don't pass test	0	Dec 2, 2023
very simple Genetic Algorithm completed	4	Jan 31, 2008
Report to Recipient(s)	0	Feb 27, 2004
FreeRide problem	1	Oct 12, 2004

are elements of a list in sequence in list b

Matthew_WARREN

Paul Hankin

Arnaud Delobelle

Tomek Paczkowski

Steven D'Aprano

bearophileHUGS

Steven D'Aprano

Paddy

bearophileHUGS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads