K-Nearest Neighbors

At the highest level, the k-nearest neighbors involves taking the set of items for which the class labels are known and comparing the new item against that set. The algorithm tries to find items in the data set that are most similar (see Similarity Metrics") and takes the average of those values in order to predict the class label. The number of values used to determine the average is defined as k, hence the term k-nearest neighbors.

There are some issues associated with this algorithm. First of all, the value chosen for k greatly affects the class label that will be predicted. Choosing too few k and the dependent variable becomes too dependent on random variables. Liekwise, choosing too many k values will result in the reduction of the accuracy. The correct k value is an optimization problem, and this issue outside the scope of the discussion of the k-nearest neighbors classifier.

Another issue is when the neighbors for a given test data is "too far" away. There are several methods for dealing with these scenarios. The primary method is to weight the neighbors according to their distance.

1. Inverse function This method simply takes the weight as the inverse distance. For cases where items are approximately equivalent to their original value, a very small number can be added to distance before taking the inverse. It should be noted that the inverse function tends to apply heavy weights to items in close proximity items sets weights apprximately equal to zero for far items. The inverse function is also heavily affected by noise in the data set.

2. Subtraction function The subtraction function simply subtracts some non-negative constant from the distance and uses this values as the weight. The problem with this function is that the weight will eventually equal zero and as a result, some items will be ignored.

Next: Index
Back to: Decision Trees