Discussion of Similarity Metrics

Similarity between data objects can be represented in a variety of ways. The first method that can be used is the distance. Because data objects are abstract concepts for a machine, distance between data objects is sum of the distances of each attribute of the data objects (i.e. Euclidean Distance). Another method the machines can use to determine the similarity between data objects is by measuring how the attributes of both data objects change with respect to the variation of the mean value for the attributes. This method of determining the simularity is the Pearson Correlation coefficient.

There may be cases when the data objects are not simply a group of numbers, but perhaps a boolean value. To represent the similarity to a machine, finding the ratio between the number of matching attrbutes to the total number of attributes is a better metric, which is the case with the Jaccard Coefficient.

Yet another case may occur when the data objects simply contain too many attributes to analyze each dimension (as required by the distance-based family of similarity metrics) and is asymmetric (more values are absent than present) in order to use the Jaccard Coefficient. In this event, the cosine similarity may be used and an example of this metric being used is with document comparison. By using the word frequencies for each document, the normalized dot product of the frequencies can be used as a measure of similarity.

More detail is included on their respective pages:

  1.         Euclidean Distance
  2.         Pearson Correlation
  3.         Jaccard/Tanimoto Similarity
  4.         Cosine similarity

Back to: Main Index