Project 3: Data Exploration & Visualization

Objective

Gain some experience using some common data mining tools to quickly generate summary statistics to help paint a picture of a dataset. Gain an understanding of Processing as a flexible data visualization tool.

Requirements

This assignment has two parts.

Summary Statistics with Weka and Knime

Create a subdirectory inside your repository called project01/summary_stats.

Using both Weka and Knime, create a screenshot of summary statistics generated by these two tools using a dataset of your choice. I recommend you use the classic Iris, Weather, or Mushroom datasets, but please use any dataset or sample-set you wish. If you need assistance finding data, start a discussion on Piazza.

If you cannot figure out how to view these summary statistics using these tools, start a discussion on Piazza. But, realize that both tools have some basic documentation to get you started. If you spend more than an hour on this, you should seek assistance via Piazza; this should not take you long.

Save two (or more) screenshots of these summary stats inside your summary_stats directory. Add and push:

git add project01/summary_stats
git commit -am "Your meaningful commit message."
git push

Data Visualization with Processing

While common data mining tools can be used to quickly generate traditional data visualizations, the best data miners have more powerful, flexible visualization tools at their disposal. This assignment is designed to introduce you to Processing as a tool to keep in your toolbox, to be used when you need to generate a specific data visualization. Why Processing? It is productive, provides you access to the entire Java ecosystem, allows for easy animation and interaction, and the "Processing model" has been implemented in many languages. If you learn Processing, you can use similar frameworks in other languages.

Your goal for this project is to experience how you might use Processing to provide different perspectives on your data. The dataset used is simple -- some "random" numbers chosen by humans. Rather than focus on the data, the point of this assignment is to learn a bit about Processing, and to see how much mileage you can get out of a small amount of code.

First, download and install Processing. Next, create a subdirectory in your repo called project01/visualization. Now download this boilerplate code and extract it to your visualization directory. Be sure to add and commit.

Important: I have added a .gitignore file such that Credentials.pde does not become part of your repository. Do not add Credentials.pde to your repo, or everyone will have access to your google credentials.

Update the file Credentials.pde with your google username and password.

Work your way through this online tutorial and be sure to commit your code at every sensible iteration of your work. Note that the author suffers from semicolon-itis. While I have removed code formatting issues and semicolons from the provided codebase, you must be aware of this dreaded disease while reading the author's code in the tutorial.

Don't forget to push your work when finished.

Grading Criteria (500 points)

You must have at least two screenshots (one from Weka and one from Knime) that simply demonstrate summary stats generated by these tools.

You must complete the "random numbers" visualization tutorial and your repository must show a history of work. One or two commit messages saying "doing some work" is not sufficient.

Due Date

This assignment is due by midnight on Monday, September 12.