M
marc magrans de abril
Dear colleagues,
I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.
My first naive implementation was something like:
patterns = []
while(logs):
pattern = logs[0]
new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
entry = (len(logs)-len(new_logs),pattern)
patterns.append(entry)
logs = new_logs
Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..
I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.
My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
print count
if dist(pattern,l)>THERESHOLD:
new_logs.append(l)
The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.
I am a little lost. Anyone knows the reson of this behavior? How
should I write a program that deals with large data sets in python?
Thanks a lot!
marc magrans de abril
I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.
My first naive implementation was something like:
patterns = []
while(logs):
pattern = logs[0]
new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
entry = (len(logs)-len(new_logs),pattern)
patterns.append(entry)
logs = new_logs
Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..
I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.
My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
print count
if dist(pattern,l)>THERESHOLD:
new_logs.append(l)
The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.
I am a little lost. Anyone knows the reson of this behavior? How
should I write a program that deals with large data sets in python?
Thanks a lot!
marc magrans de abril