Mining for Groups Using Clustering Algorithms

Cluster

Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships. The goal is that the objects within a group be similar to one another and different from the objects in other groups. The greater the similarity within a group and the greater the difference between groups, the better or more distinct the clustering. [ 1 ]

This page then contains a brief discussion of several important cluster algorithms.

K-means Clustering

K-means is a prototype-based, partitional clustering technique that attemps to find a user-specified number of clusters (K), which are represented by their centroids. The centroid almost never corresponds to an actual data point in K-means cluster. But K-means requires a proximity measure for a pair of objects since it groups the near neighbors into clusters.

The basic K-means clustering is simple. It first choose K initial centroids, where K is a user-specified parameter, namely, the number of clusters desired. Each point is then assigned to the closest centroid, and each collection of points assigned to a centroid is a cluster. It repeats the assignment ad update the clusters until no point changes clusters, or equivalently, until the centroids remain the same. The operation of K-means is formally illustrated in the following sudo code. [ 1 ]

   
Algorithm: Basic K-means algorithm.
step 1: select K points as inital centroids.
step 2: repeat
        {
           From K clusters by assigning each point to its closest centroid.
	   Recompute the centroid of each cluster.
	}
step 3: Centroid do not change.

Here is an python example of K-means clustering. This function for doing K-means clustering takes the data rows and the number of clusters as input and assign it to the nearest clusters.[ 2 ]

   
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches

Hierarchical Clustering

Hierachical clustering techniques are a second important category of clustering methods. There are two basic approaches for generating a heirarchical clustering: [ 1 ]

Agglomerative: Start with the points as individual clusters and , at each step, merge the closest pair of clusters. this requires definining a notion of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide which cluster to split at each step and how to do the splitting.

The basic hierachchical clustering techniques is starting with individual points as clusters, successively merge the two closest clusters until only one cluster remains. It is expressed more formally in the following sudo code. [ 1 ]

   
Basic hiearchical clustering algorithm:
step 1: Compute the proximity matrix, if necessary.
step 2: repeat
        {
		   Merge the closest two clusters.
		   Update the proximity matrix to reflect the proximity between
       		   the new cluster and the original clusters.
	} 
step 3: until only one cluster remains.

Here is an python example of hiearchical clustering. The code for hierarchical clustering begins by creating a group of clusters that are just the original items. The main loop of the function searches for the two best matches by trying every possible pair and calculating their correlation. The best pair of clusters is merged into a single cluster. The data for this new cluster is the average of the data for the two old clusters. This process is repeated until only one cluster remains. It can be very time consuming to do all these calculations, so it�s a good idea to store the correlation results for each pair, since they will have to be calculated again and again until one of the items in the pair is merged into another cluster. [ 2 ]

   
   
def hcluster(rows,distance=pearson):
distances={}
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d < closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren't in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]

DBSCAN

DBSCAN is density-based clustering technique. It locates regions of high density that are seperated from one another by regions of low density. Density is estimated for a particular point in the data set by counting the number of points with a specified radius, Eps, of that point. [ 1 ]

The density-based approach allows us to classify a point as being three categories:

Core point: interior of a dense region.
Border point: on the edge of a dense region.
Noise point: sparsely occupied region.

Given the definition of core points, border points, and noise points, the basic DBSCAN algorithm finds the clusters with the following approach. Add two core points that are close enough, that is within a distance Eps of one another, are put in the same cluster. Likewise, any border point that is close enough to a core point is put in the same cluster as the core point. Noise points are discarded. The formal sudo code of DBSCAN algorithm is as below.[ 1 ]

   
   
Basic DBSCAN algorithm
step 1: lable all pints as core, border, or noise points.
step 2: eliminate noise points.
step 3: put an edge between all core points that are within Eps of each other.
step 4: make each group of connected core points into a separate cluster.
step 5: assign each border point to one of the clusters of its associated core points.

References

[1] Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Published by Addison Wesley.

[2] Programming Collective Intelligence, Toby Segaran, First Edition, Published by O’ Reilly Media, Inc.