D
Dave Angel
I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
Here's my original code:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
Here's the code now:
# Parse data and make a list of the time for each message matching the specified message for each tool
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0]].append(row[1])
# now pull out each message that is within the time context for each matched message
# as well as the matched messages themselves
# return true is we should keep this message
def determine(tup):
if self.message in tup[2]: return True # matched message
if seconds == 0: return False # no time context specified
times = messageTimes[tup[0]] # get the list of matched messages for this tool
le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1]
ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1]
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
cdata = [tup for tup in cdata if determine(tup)]
In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.
I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold.
I hope I've explained this better. Thanks again.
That is better, and the point I missed noticing before is that
messageTimes is substantially smaller than cdata; it's already been
filtered down by looking for self.message in its row[2]. The code was
there, but I didn't relate. Remember I was bothered that you didn't
look at tup[2] when you were comparing for time-similarity. Well, you
did that implicitly, since messageTimes was already filtered. Sorry
about that.
That also lowers my expectations for improvement ratio, since instead of
600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
now my expectations are only 4:1 to 10:1.
Still, there's room for improvement. (1) You should only need one
bisect in determine, and (2) if you remember the last result for each
tool, you could speed that one up some.
(1) Instead of getting both and le and a ge, get just one, by searching
for tup[1]-tdiff. Then by comparing that row's value against
tup[1]+tdiff, you can return immediately the boolean, the expression
being about half of the one you've now got.
(2) Make a dict of ints, keyed by the tool, and initialized to zero.
Call that dict "found." Each time you do a bisect, specify a range
starting at found[tool]. Similarly, store the result of the bisect in
found[tool]. This should gradually restrict the range searched by
bisect, which COULD save some time. It works because everything's sorted.
Naturally, make these changes independently, and time the changes. In
particular, it's possible that #2 will degrade performance instead of
improving it. But #1 should double performance, pretty close.
I hope these were clear enough. I don't want to write the changes
myself, since with no test data, I could easily make a mess of it.
.....
I can think of other changes which are less likely to make substantial
improvement, and which might degrade things. For one drastic example,
instead of any messageTimes storage, you could have ("flags") a list of
bool, same size as ctimes, which tracked the state of each line. You
loop through cdata, identifying and marking any line whose tup[2]
matches. And for each match, you scan backwards and forwards till the
time gets out of range (in one direction stopping at time-tdiff or 0 or
tool change, and in the other direction at time+tdiff or len, or
toolchange), marking each one within tdiff.
Then after one pass through cdata, you can have a very simple list
comprehension, something like:
cdata = [tup for index, tup in enumerate(cdata) if flags[index]]
Will it be faster? Depends on the number of expected matches (eg.
30,000 out of 300,000 is 10%), and how much data forward and backwards
you need to scan.
.....
A very simple difference? Perhaps using implicit unpacking instead of
using tup[0] etc. will be faster. It'll certainly be easier to read.
for tool, time, text in cdata:
if self.message in text:
messageTimes[tool]. append(time)
def determine(tool, time, text):
and call it with determine(*tup)