Decision Tree Classifier

Classification and Decision Tree Classifier Introduction

The classification technique is a systematic approach to build classification models from an input dat set. For example, decision tree classifiers, rule-based classifiers, neural networks, support vector machines, and naive Bayes classifiers are different technique to solve a classification problem. Each technique adopts a learning algorithm to identify a model that best fits the relationshio between the attribute set and class label of the input data. Therefore, a key objective of the learning algorithm is to build prdictive model that accurately predict the class labels of previously unkonw records.

Decision Tree Classifier is a simple and widely used classification technique. It applies a straitforward idea to solve the classification problem. Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test record. Each time time it receive an answer, a follow-up question is asked until a conclusion about the calss label of the record is reached.

Decision Tree Based Method

The decision tree classifiers organized a series of test questions and conditions in a tree structure. The following figure [ 1 ] shows a example decision tree for predictin whether the person cheats. In the decision tree, the root and internal nodes contain attribute test conditions to separate recordes that have different characteristics. All the terminal node is assigned a class lable Yes or No.

Once the decision tree has been constructed, classifying a test record is straightforward. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. It then lead us either to another internal node, for which a new test condition is applied, or to a leaf node. When we reach the leaf node, the class lable associated with the leaf node is then assigned to the record, As shown in the follwoing figure [ 1 ], it traces the path in the decision tree to predict the class label of the test record, and the path terminates at a leaf node labeled NO.

Build A Decision Tree

Build a optimal decision tree is key problem in decision tree classifier. In general, may decision trees can be constructed from a given set of attributes. While some of the trees are more accurate than others, finding the optimal tree is computationally infeasible because of the exponential size of the search space.

However, various efficent algorithms have been developed to construct a resonably accurate, albeit suboptimal, decision tree in a reasonable amount of time. These algorithms ususally employ a greedy strategy that grows a decision tree by making a serise of locaally optimum decisions about which attribute to use for partitioning the data. For example, Hunt's algorithm, ID3, C4.5, CART, SPRINT are greedy decision tree induction algorithms.

Hunt's Algorithm

Hunt's algorithm grows a decision tree in a recursive fashion by partitioning the trainig records into successively purer subsets. Let Dt be the set of training records that reach a node t. The general recursive procedure is defined as below: [ 1 ]

  1. If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
  2. If Dt is an empty set, then t is a leaf node labeled by the default class, yd
  3. If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

It recursively applies the procedure to each subset until all the records in the subset belong to the same class. The Hunt's algirithm assumes that each combination of attribute sets has a unique class label during the procedure. If all the records associated with Dt have identical attribute values except for the class label, then it is not possible to split these records any future. In this case, the node is decalred a leaf node with the same class label as the majority class of training records associated with this node.

Determine the best attribute test Condition

The decision tree inducing algorithm must provide a method for specifying the test condition for different attribute types as well as an objective measure for evaluating the good ness of each test condition.

First, the specification of an attribute test condition and its corresponding outcomes depends on the attribute types. We can do two-way split or multi-way split, discretize or group attribute values as needed. The binary attributes leads to two-way split test condition. For norminal attributes which have many values, the test condition can be expressed into multiway split on each distinct values, or two-way split by grouping the attribute values into two subsets. Similarly, the ordinal attributes can also produce binary or multiway splits as long as the grouing does not violate the order property of the attribute values. For continuous attributes, the test condition can be expressed as a comparsion test with two outcomes, or a range query. Or we can discretize the continous value into nominal attribute and then perform two-way or multi-way split.

Since there are may choices to specify the test conditions from the given training set, we need use a measurement to determine the best way to split the records. The goal of best test conditions is whether it leads a homogenous class distribution in the nodes, which is the purity of the child nodes before and after spliting. The larger the degree of purity, the better the class distribution.

To determine how well a test condition performs, we need to compare the degree of impurity of the parent before spliting with degree of the impurity of the child nodes after splitting. The larger their differnce, the better the test condition. The measurment of node impurity/purity are:

Stop the Split Procedure

A stop condition is also needed to terminate the tree-growing process. A possible strategy is to continue expainding a node until either all the records belong to the same class or all the records have identical attribute values. Although they are sufficient conditions to stop decision tree induction algorithm, some algorithm also applies other criteria to terminate the tree-growing procedure earlier.

Algorithm for Decision Tree Induction

The decision tree induction algorithm works by recursively selecting the best attribute to split the data and expanding the leaf nodes of the tree until the stopping cirterion is met. The choice of best split test condition is determined by comparing the impurity of child nodes and also depends on which impurity measurement is used. After building the decision tree, a tree-prunning step can be performed to reduce the size of decision tree. Decision trees that are too large are susceptible to a phenomenon known as overfitting. Pruning helps by trimming the branches of the initail tree in a way that improves the generalization capability of the decision tree.

Here is a example recursive function [ 2 ] that builds the tree by choosing the best dividing criteria for the given data set. It is called with list of rows and then loops through every column (except the last one, which has the result in it), finds every possible value for that column, and divides the dataset into two new subsets. It calculates the weightedaverage entropy for every pair of new subsets by multiplying each set’s entropy by the fraction of the items that ended up in each set, and remembers which pair has the lowest entropy. If the best pair of subsets doesn’t have a lower weighted-average entropy than the current set, that branch ends and the counts of the possible outcomes are stored. Otherwise, buildtree is called on each set and they are added to the tree. The results of the calls on each subset are attached to the True and False branches of the nodes, eventually constructing an entire tree.

   
def buildtree(rows,scoref=entropy):
if len(rows)==0: return decisionnode( )
current_score=scoref(rows)
# Set up some variables to track the best criteria
best_gain=0.0
best_criteria=None
best_sets=None
column_count=len(rows[0])-1
for col in range(0,column_count):
# Generate the list of different values in
# this column
column_values={}
for row in rows:
column_values[row[col]]=1
# Now try dividing the rows up for each value
# in this column
for value in column_values.keys( ):
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if gain>best_gain and len(set1)>0 and len(set2)>0:
best_gain=gain
best_criteria=(col,value)
best_sets=(set1,set2)
# Create the subbranches
if best_gain>0:
trueBranch=buildtree(best_sets[0])
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],
tb=trueBranch,fb=falseBranch)
else:
return decisionnode(results=uniquecounts(rows))	
   

The next example function [ 2 ] is pruning the built decision tree. Pruning involves checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold. If so, the leaves are merged into a single node with all the possible outcomes. This helps avoid overfitting and stops the tree from making predictions that are more confident than what can really be gleaned from the data.

When prune function is called on the root node, it will traverse all the way down the tree to the nodes that only have leaf nodes as children. It will create a combined list of results from both of the leaves and will test the entropy. If the change in entropy is less than the mingain parameter, the leaves will be deleted and all their results moved to their parent node. The combined node then becomes a possible candidate for deletion and merging with another node.

def prune(tree,mingain):
# If the branches aren't leaves, then prune them
if tree.tb.results==None:
prune(tree.tb,mingain)
if tree.fb.results==None:
prune(tree.fb,mingain)
# If both the subbranches are now leaves, see if they
# should merged
if tree.tb.results!=None and tree.fb.results!=None:
# Build a combined dataset
tb,fb=[],[]
for v,c in tree.tb.results.items( ):
tb+=[[v]]*c
for v,c in tree.fb.results.items( ):
fb+=[[v]]*c
# Test the reduction in entropy
delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2)
if delta
  
 
   

Conclusion

Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms employ a heuristic-based approach or greedy strategy to guide their search in the vast hypothesis space. The constructing decision tree techniques are generally computationally inexpensive, making it possible to quickly construct models even when the training set size is very large. Furthermore, once a decision tree has been built, classifying a test record is extremely fast.

References

[1] Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Published by Addison Wesley.

[2] Programming Collective Intelligence, Toby Segaran, First Edition, Published by O’ Reilly Media, Inc.