Applied Theory & Practice: Final Project
Objectives
Your goal with the final project is to:
- Demonstrate your ability to explore the data methodically, documenting your discoveries.
- Properly define summary statistics
- Discover ways of transforming attribute values to simplify mining
- Detect anomalies (and decide what to do with them)
- Evaluate the data quality (and take appropriate action)
- Choose a particular mining strategy, given your understanding of the dataset and mining goal
- Execute your mining strategy, and document your process in a professional manner
Scenario
You've just been hired by Big Data International who has harvested some ski industry related data. [here]
"I just know there's got to be pattens in the data," remarks the CTO, "But every time I start to try to mine the data I get nowhere. I mean, I just can't think of the right magic query!" You hold your tongue and envision the happy day of your impending paycheck.
The dataset is actually culled from real-world data. Some attributes have been obfuscated and patterns in multiple attribute subsets have been introduced. You can imagine that this data represents a person's rating, their score on a survey, the number of prizes they accepted, the number of punishments they received, and lastly a suite of binary attributes representing ski resorts they selected.
The pointy-headed CTO expects answers to the following:
- How might you describe the data? (summary stats of attributes, sets of attributes, correlations) Be as thorough as possible.
- Can you transform the data to simplify your mining approach?
- Are there problems with the data quality? If so, what problems? How will you handle them?
- Are there anomalies in the data? If so, can you identify them?
- Are there distinguished groups of data objects? If so, what characterizes these groups best?
- Are there associations across the ski resorts selected? If so, what are they?
Your analysis is not limited to these questions but are mentioned here as a guide. What else can you discover?
Grading Criteria
You will be evaluated on the quality of your final project documentation, which should be added to your portfolio. Your documentation should include topic areas describing:
- Exploration of the dataset
- Summary statistics
- Data preprocessing (cleansing, transformation)
- Discussion of mining strategy choice
- Description of your mining process
- Conclusions
You may engage in the following mining paths:
- A mining process using at least four strategies with a toolkit of your choice.
- A suite of eight different visualizations created from the data.
- Your own programmatic implementation of two mining strategies.
You may mix portions of these requirements if you wish, but please consult the instructor to make sure what you plan to submit is sufficient.
Due December 17 @ 11:59PM. No exceptions.