Identifying the start of good data in a list

T

tkpmep

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips
 
B

bearophileHUGS

First solutions I have found, not much tested beside the few doctests:

from itertools import islice

def start_good1(alist, good_ones=4):
"""
Maybe more efficient for Python
>>> start_good = start_good1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
for i in xrange(len(alist) - good_ones + 1):
if all(islice(alist, i, i+good_ones)):
return i
return -1



def start_good2(alist, good_ones=4):
"""
Maybe more efficient for Psyco
>>> start_good = start_good2
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
n_good = 0
for i, el in enumerate(alist):
if alist:
if n_good == good_ones:
return i - good_ones
else:
n_good += 1
else:
n_good = 0
if n_good == good_ones:
return len(alist) - good_ones
else:
return -1


if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests done\n"

Bye,
bearophile
 
B

bearophileHUGS

Sorry, in the Psyco version replace this line:
for i, el in enumerate(alist):

With:
for i in xrange(len(alist)):

because Psyco doesn't digest enumerate well.

Bye,
bearophile
 
M

Matthew Fitzgibbons

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips


Maybe this will do?

reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count = 0
for i, d in enumerate(reHist):
if d == 0:
count = 0
else:
count += 1
if count == 5:
break
else:
raise Exception("No data found")
reHist = reHist[i-4:]
print reHist


-Matt
 
M

Mensanator

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at  which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

    flag = True
    i=-1
    j=0
    while flag and i < len(retHist)-1:
        i += 1
        if retHist == 0:
            j = 0
        else:
            j += 1
            if j == 5:
                flag = False

    del retHist[:i-4]

Thanks in advance for your help

Thomas Philips


Here's my attempt:

LL = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

i = 0

while (i<len(LL)) and (0 in LL[i:i+5]):
i += 1

print i, LL[i:i+5]

##
## 4 [1, 2, 3, 4, 5]
##
 
E

Emile van Sebille

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
.... if 0 not in retHist[ii:ii+5]:
.... break

Well, to the extent short and sweet is elegant...

Emile
 
T

tdmj

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips


With regular expressions:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
hist_str = ''.join(str(i) for i in hist)
match = re.search(r'[1-9]{5, }', hist_str)
hist = hist[-5:] if match is None else hist[match.start():]

Or slightly more concise:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
match = re.search(r'[1-9]{5, }', ''.join(str(i) for i in hist))
hist = hist[-5:] if match is None else hist[match.start():]

Tommy McDaniel
 
T

tkpmep

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at  which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

 >>> for ii,dummy in enumerate(retHist):
...     if 0 not in retHist[ii:ii+5]:
...         break

 >>> del retHist[:ii]

Well, to the extent short and sweet is elegant...

Emile

This is just what the doctor ordered. Thank you, everyone, for the
help.

Sincerely

Thomas Philips
 
T

Terry Reedy

Matthew said:
(e-mail address removed) wrote:
reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count = 0
for i, d in enumerate(reHist):
if d == 0:
count = 0
else:
count += 1
if count == 5:
break
else:
raise Exception("No data found")
reHist = reHist[i-4:]
print reHist

This is what I would have suggested, except that the 'if count' test
should be left under the else clause, as in the original, so I consider
it the best of the responses ;-)

I thought of the repeated slicing alternative, but it would be slightly
slower. However, for occasional runs, the difference would be trivial.

Worrying about what Psyco does for this problem is rather premature
optimization.

My quarter's worth....

tjr
 
S

Steven D'Aprano

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero data).
For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], I
would define the point at which data turns good to be 4 (1 followed by
2, 3, 4, 5).
....

With regular expressions:

Good grief. If you're suggesting that as a serious proposal, and not just
to prove it can be done, that's surely an example of "when all you have
is a hammer, everything looks like a nail" thinking.

In this particular case, your regex "solution" gives the wrong result,
indicating that you didn't test your code before posting. Hint:

re.search(r'[1-9]{5, }', "123456")

returns None.

The obvious fix for that specific bug is to use r'[1-9]{5,5}', but even
that will fail. Hint: what happens if an item has more than one digit?

Before posting another regex solution, make sure it does the right thing
with this:

[0, 0, 101, 0, 1002, 203, 3050, 4105, 5110, 623, 777]
 
G

Gerard flanagan

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips




data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x


for d in itergood(data):
print d
 
G

George Sakkis

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
for ii,dummy in enumerate(retHist):
... if 0 not in retHist[ii:ii+5]:
... break
del retHist[:ii]
Well, to the extent short and sweet is elegant...

This is just what the doctor ordered. Thank you, everyone, for the
help.

Note that the version above (as well as most others posted) fail for
boundary cases; check out bearophile's doctest to see some of them.
Below are two more versions that pass all the doctests: the first
works only for lists and modifies them in place and the second works
for arbitrary iterables:

def clean_inplace(seq, good_ones=4):
start = 0
n = len(seq)
while start < n:
try: end = seq.index(0, start)
except ValueError: end = n
if end-start >= good_ones:
break
start = end+1
del seq[:start]

def clean_iter(iterable, good_ones=4):
from itertools import chain, islice, takewhile, dropwhile
iterator = iter(iterable)
is_zero = float(0).__eq__
while True:
# consume all zeros up to the next non-zero
iterator = dropwhile(is_zero, iterator)
# take up to `good_ones` non-zeros
good = list(islice(takewhile(bool,iterator), good_ones))
if not good: # iterator exhausted
return iterator
if len(good) == good_ones:
# found `good_ones` consecutive non-zeros;
# chain them to the rest items and return them
return chain(good, iterator)

HTH,
George
 
G

George Sakkis

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]
Thanks in advance for your help
Thomas Philips

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x

for d in itergood(data):
print d


This seems the most efficient so far for arbitrary iterables. With a
few micro-optimizations it becomes:

from itertools import chain

def itergood(indata, good_ones=4):
indata = iter(indata); get_next = indata.next
buf = []; append = buf.append
while len(buf) < good_ones:
next = get_next()
if next: append(next)
else: del buf[:]
return chain(buf, indata)

$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
100 loops, best of 3: 3.09 msec per loop

And with Psyco enabled:
$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
1000 loops, best of 3: 466 usec per loop

George
 
B

bearophileHUGS

George Sakkis:
This seems the most efficient so far for arbitrary iterables.

This one probably scores well with Psyco ;-)


def start_good3(seq, good_ones=4):
"""
>>> start_good = start_good3
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
n_good = 0
pos = 0
for el in seq:
if el:
if n_good == good_ones:
return pos - good_ones
else:
n_good += 1
elif n_good:
n_good = 0
pos += 1
if n_good == good_ones:
return pos - good_ones
else:
return -1

Bye,
bearophile
 
C

castironpi

George Sakkis:


This one probably scores well with Psyco ;-)

def start_good3(seq, good_ones=4):
    n_good = 0
    pos = 0
    for el in seq:
        if el:
            if n_good == good_ones:
                return pos - good_ones
            else:
                n_good += 1
        elif n_good:
                n_good = 0
        pos += 1
    if n_good == good_ones:
        return pos - good_ones
    else:
        return -1

Bye,
bearophile

There, that's the regular machine for it. Too much thinking in
objects, and you can't even write a linked list anymore, right?
 
G

George Sakkis

George Sakkis:


This one probably scores well with Psyco ;-)

I think if you update this so that it returns the "good" iterable
instead of the starting index, it is equivalent to Gerard's solution.

George
 
G

George Sakkis

There, that's the regular machine for it. Too much thinking in
objects, and you can't even write a linked list anymore, right?

And you're still wondering why do people killfile you or think you're
a failed AI project...
 
C

castironpi

And you're still wondering why do people killfile you or think you're
a failed AI project...

Just jumping on the bandwagon, George. And you see, everyone else's
passed the doctests perfectly. Were all the running times O( n* k )?
 
G

Gerard flanagan

George said:
I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False
del retHist[:i-4]
Thanks in advance for your help
Thomas Philips

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x

for d in itergood(data):
print d


This seems the most efficient so far for arbitrary iterables. With a
few micro-optimizations it becomes:

from itertools import chain

def itergood(indata, good_ones=4):
indata = iter(indata); get_next = indata.next
buf = []; append = buf.append
while len(buf) < good_ones:
next = get_next()
if next: append(next)
else: del buf[:]
return chain(buf, indata)

$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
100 loops, best of 3: 3.09 msec per loop

And with Psyco enabled:
$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
1000 loops, best of 3: 466 usec per loop

George
--


I always forget the 'del slice' method for clearing a list, thanks.

I think that returning a `chain` means that the function is not itself a
generator. And so if the indata has length less than or equal
to the threshold (good_ones), an unhandled StopIteration is raised
before the return statement is reached.


G.
 
C

castironpi

Below are two more versions that pass all the doctests: the first
works only for lists and modifies them in place and the second works
for arbitrary iterables:

def clean_inplace(seq, good_ones=4):
    start = 0
    n = len(seq)
    while start < n:
        try: end = seq.index(0, start)
        except ValueError: end = n
        if end-start >= good_ones:
            break
        start = end+1
    del seq[:start]

def clean_iter(iterable, good_ones=4):
    from itertools import chain, islice, takewhile, dropwhile
    iterator = iter(iterable)
    is_zero = float(0).__eq__
    while True:
        # consume all zeros up to the next non-zero
        iterator = dropwhile(is_zero, iterator)
        # take up to `good_ones` non-zeros
        good = list(islice(takewhile(bool,iterator), good_ones))
        if not good: # iterator exhausted
            return iterator
        if len(good) == good_ones:
            # found `good_ones` consecutive non-zeros;
            # chain them to the rest items and return them
            return chain(good, iterator)

HTH,
George

You gave me an idea-- maybe an arbitrary 'lookahead' iterable could be
useful. I haven't seen them that much on the newsgroup, but more than
once. IOW a buffered consumer. Something that you could check a
fixed number of next elements of. You might implement it as a
iterator with a __getitem__ method.

Example, unproduced:
import itertools
a= itertools.count( )
a.next() 0
a.next() 1
a[ 3 ] 5
a.next() 2
a[ 3 ]
6

Does this make sense at all?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top