Chapter 3 of our Collective Intelligence book covers simple clustering of word vectors and the visualization of these clusters. The premise is that by clustering blogs based on word frequencies, it may be possible (it is) to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Discovering these groups becomes useful in searching, cataloging and discovering large amounts of online text.
We will also use the Zebo data to identify 'groups of things people want.'
Do not spend time generating the word count file. The data for your clustering script is here for you to download.
Do not spend time parsing the Zebo data. The Zebo data is here to download.
For this assignment, focus on the clustering and not the jpg generation. We'll cover that in the next assignment.
In a nutshell, you're responsible for implementing clusters.py as detailed in our text. The tasks below will be discussed in class.