Project 5: Clustering

Objective

Prove your comprehension of classic clustering algorithms by implementing one in the programming language of your choice.

Requirements

Create a subdirectory inside your repository called project05.

Using the programming language of your choice, implement one of either K-means, hierarchical or density-based clustering algorithms. Your implementation should use one of the similarity metric functions you created in Project 4. As you'll likely need to modify your implementation to "fit" this project, I recommend that once you're ready for your clusterer to measure proximity, that you simply copy and paste the code from your original similarity metric implementation and refactor from there.

To develop, test and illustrate your implementation, create a simple program that demonstrates or "tests" your clusterings against the classic iris dataset. You can use a unit test suite if you wish, but simple "poor-man's testing" is fine for this assignment.

The design of your implementation is entirely up to you. Here is a csv version of iris (attributes are sepal length, sepal width, petal length, and petal width. Here is a sqlite3 database containing the same data.

Your "test" program should create the clusterings and evaluate them by calculating the SSE for each. Print a simple summary of the clusters that includes the number of members and the SSE for each. How does your output compare to results generated from Weka or Knime?

You'll likely want to commit after each sensible step (a day's work, a specific feature or class is complete, etc). Don't forget to push your work when finished. Get stuck? Check in on Piazza or drop by office hours.

Grading Criteria (1000 points)

You must have one clustering implementation working correctly.

You must have some sort of executable program/script or test suite that demonstrates your clustering implementation at work.

Due Date

This assignment is due by midnight on Wednesday, September 28.