M
MooMaster
A similar discussion has already occurred, over 4 years ago:
http://groups.google.com/group/comp...9928?lnk=gst&q=list+in+place#5dff55826a199928
Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.
The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of records from the famous iris
dataset:
5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor
Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:
<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)
def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))
#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True
</code>
It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:
for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])
but when I print out the points afterward, it doesn't keep the
changes.
What's a more efficient way of doing this?
http://groups.google.com/group/comp...9928?lnk=gst&q=list+in+place#5dff55826a199928
Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.
The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of records from the famous iris
dataset:
5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor
Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:
<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)
def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))
#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True
</code>
It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:
for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])
but when I print out the points afterward, it doesn't keep the
changes.
What's a more efficient way of doing this?