Final Project: Mining Process with Real World Ski Resort Data Set

Project Goal

The ski resort data set is collected from real world and we can expect it would contain noise and anomalies due to the various variances and errors in the data collection process. The goal of the project to explore the data set methodically, distinguish the noise data with data-preprocessing process, and discover the internal data pattern with by using a proper mining strategy. The entire mining process and data anaylze result is helpful to understand and characterize the pattern inside the dataset.

Ski Resort Data Set

The Ski Resort Data Set is given from Data Mining Course by Yong Bakos. It is a raw data file in .csv format. I use weka to load the data file and save the dataset as final.arff file in default ARFF file format for future processing. It contains 989 data objects and each object has 16 attributes. The data schema is shown as below. All the data attribute are numeric or nomial. The attribute set rating, Survey, Prize, Punishment represents the overall assessment from a subject. The other attribute set Aspen, Snowmass, ..., Eldora represents the different ski resorts a subject rates.

The visualization plot of the attribute sets is as below. Four attributes are numeric, 10 attributes are binary attributes, and two attributes are nominal.

Data Mining Toolkit

The following data mining algorithms and tasks are performed in WEKA3.6. WEKA is open source software. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. WEKA 3.6 is the latest stable version of WEKA. I installed WEKA 3.6 on Windows XP with JDK 1.6.

Data Pre-Processing

I first looked at the data objects with missing values and removed the incomplete data from the dataset. Then I used Density-Based cluster algorithm to detect whether there are anomalies in the data set. If the cluster algorithm found anomal data points far from the most data points, the anomal data points are probably noise data and will be removed from data set as well.

Remove Instances Containing Missing Value

The explore tool in WEKA shows excellent statistic result for each data attribute, including the min, max, mean, standard deviation, the number of distinct value and whether there is missing value in the data set. It is straightforward to find out that Loveland and Crested Butte attributes have confused values in certain data objects. Since Loveland and CrestedButte represent the selected ski resort, we can infer that they should be binary attribute as other ski resort attribute set. Thus the value "1-" and "Q" are highly possible error data objects and should be removed from data set. And Silverton attribute has 4% missing values of the whole data set, which are 42 data records. I decided to remove theses instance containning missing value as well because we do not have any prior knowledge about the incomplete data record.

Unclear "1-" value in Loveland Attribute
To analyze the data objects containing "1-" value in Loveland attribute, I used a filter filters.unsupervised.instance.SubsetByExpression to select the subset of "1-" data objects. 43 data objects contains "1-" value and they have the same value in all the attributes. The visualized plot of the "1-" subset data shows that they have same distribution. The consistent "1-" data records are more likely introduced by particular error. Thus it is reasonable to remove the 43 "1-" subset data from the data set.
Unclear "Q" value in CrestedButte Attribute
I used the same filter to identify the subset of "Q" data object with different filter parameters ATT10 is 'Q'. Here ATT10 represents the CrestedButte Attribute. After applied the filter in the data set, it filters out the all the "Q" subset data. The subset has 42 data objects. Similarly, all the data objects have the same value in all the attributes, as shown in the following visulized plot of "Q" data subset. Due to the same reason, I removed the 42 "Q" subset data from data set as well.
Missing value "?" in Silverton Attribute
The same filter is used again to filter out the subset containing missing value "?" in Silver Attribute. The "?" subset contains 42 data objects and all of them have same value in all the attributes, shown in the following plot. Then I removed the "?" subset data from the data set as well.

Final Project: Mining Process with Real World Ski Resort Data Set

Project Goal

Ski Resort Data Set

Data Mining Toolkit

Data Pre-Processing

Remove Instances Containing Missing Value

Detect Anomalies

Mining Processes

Simple KMeans Clustering Analyze

Naive Bayesian Classifier Analyse

Decision Tree Classifier

Association Rule Mining

Conclusion