CSCI 568: Data Mining

Final Project

Due Friday, Dec 12 by 5PM.

Objective

The purpose of this project is to familiarize you with the process of data mining using a modern programming toolkit to apply numerous data mining strategies.

Tools

This project uses Orange, a suite of data mining tools interfaced via C++, Python or through GUI widgets.

Deliverables

This project will require you to do five things:

  1. Read and briefly summarize all documents on the reading list.
  2. Complete the tutorial steps and record relevant notes about the Orange Python api.
  3. Using the tutorial and documentation examples as a guide, complete your own data mining process against a dataset of your choosing.
  4. Submit complete documentation of items 1 - 3 above.
  5. Commit all example code to your repository.

Your grade is based on your documentation of this project and your code repository.

What should be in the documentation?

Reading List

A reading list is provided here, with targeted questions about each. Include your answers / summaries in the final documentation.

The orange home page.

The orange 'screenshots' page.

From Experimental Machine Learning to Interactive Data Mining.

Orange Widgets & Visual Programming and Orange and Visual Programming

Orange Widgets for Functional Genomics.

The Orange Tutorial

For each step in the tutorial below, unless otherwise noted, write a short summary of what you accomplished, complete the Python examples provided and commit your Python code to your repository.

Your Own Data Mining Process

After you've completed the tutorial steps above, you should have a good understanding of what tools are available to you in Orange. Now it's time to try some of these approaches on a dataset of your own choosing. For this part of the project, you must:

Your data mining process doesn't have to be perfect, or even yield incredibly interesting results; the important thing is the process. So don't be afraid to try something fun even if it may not yield amazing results.

Some resources to help you:

And here are some suggestions for sources of data:

Important: Be sure your process attempts to follow the general outline of acquire, parse, filter/preprocess, mine, and postprocess (represent, refine & interact). Some of these steps may be trivial (like acquire & parse) and others more evident (filter, mine).