Decision Tree

Perhaps the simplest classifier to interpret and understand visually is the decision tree classier. As the name implies, the decision tree classifier is a visual representation of a series of if-then statements arranged in a tree to display the series of decision made by the classifier to split the data.

Providing a stark contrast to the simplicity of understanding the data visually is the complexity of training a decision tree. While dividing a dataset is simple when the attributes are nominal or binary, this task becomes difficult when the attribute is a string, such as a name of a book. Regardless, the second difficulty is the concept of "best" when it comes to which attribute will divide the dataset the best.

In order to measure the "goodness" of a dataset, several purity metrics can be used.

1. Gini impurity
The Gini impurity is the expected error rate if one of the results from dataset is randomly applied to 1 of the items in the dataset. In the event that every item is from the same set, the error rate is zero. Otherwise, find the probability of each possible outcome be dividing the number of times that the outcome occurs by the total number of rows in the dataset. Sum the products of all these probabilities. In the end, the impurity gives the overall chance that row is randomly assigned to the wrong outcome, where a higher probability results in a worse split.

2. Entropy
This is the amount of disorder in the dataset. This is determined by calculating the frequency of each item that it appears in the dataset (the number of times it appears divided by the total number of rows), denoted as p(i) for an item i. Entropy is then the product of the sum of p(i) with the log of p(i). The greater the disorder in the dataset results in a higher entropy.

The main issues with this type of classifier is that decision trees are only effective when the number of possible results are severely limited. As the number of possible results grows, so does the number of leafs on the tree. The second issue is that the nodes can only be split via inequalities (less than, greater than). Also, a decision tree is a bad choice when the dataset contains several attributes with only ones and zeroes.

Next: K-Nearest Neighbors
Back to: Bayesian Classifier