Portfolio Assignment 07

Discovering Groups using clustering algorithms

Task: Implement and discuss hierarchical clustering and K-means clustering

Chapter 3 of our Collective Intelligence book covers simple clustering of word vectors and the visualization of these clusters. The premise is that by clustering blogs based on word frequencies, it may be possible (it is) to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Discovering these groups becomes useful in searching, cataloging and discovering large amounts of online text.

We will also use the Zebo data to identify 'groups of things people want.'

Do not spend time generating the word count file. The data for your clustering script is here for you to download.

Do not spend time parsing the Zebo data. The Zebo data is here to download.

Specific Tasks

For this assignment, focus on the clustering and not the jpg generation. We'll cover that in the next assignment.

In a nutshell, you're responsible for implementing clusters.py as detailed in our text. The tasks below will be discussed in class.

  1. Implement functions that generate clusters using the hierarchical and K-means algorithms.
  2. Explain the hierarchical and K-means clustering algorithms in your own words / diagrams. Compare and contrast the two.
  3. Implement in pseudocode a high-level data mining algorithm that uses K-means clustering to determine if meaningful clusters exist. Document what decisions your algorithm makes and in what cases it fails. Since it has been asserted that the meaning of 'meaningful' itself depends on context, be sure to consider different contexts in your explanation. For example, given a particular dataset, do you expect the clusters to be dense? What about the relative distance of objects to centroids vs. the distance between centroids themselves? What about outliers?