Discussion of Similarity Metrics

Jaccard / Tanimoto Coefficient

Analysis

In some case, each attribute is binary such that each bit represents the absence of presence of a characteristic, thus, it is better to determine the similarity via the overlap, or intersection, of the sets.

Simply put, the Tanimoto Coefficient uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation:



In this equation, N represents the number of attributes in each object (a,b). C in this case is the intersection set.

Python Implementation

# Inputs: two lists
# Output: the Tanimoto Coefficient
def tanimoto (list1, list2):
  intersection = [common_item for common_item in list1 if common_item in list2]
  return float(len(c))/(len(a) + len(b) - len(c))
  

References

The previous content is based on Chapter 3 of the following book:
Segaran, Toby. Programming Collective Intelligence: Building Smart Web 2.0 Applications. Sebastopol, CA: O'Reilly Media, 2007.


Next: Cosine Similarity
Back to: Pearson Correlation