Data Mining Portfolio

Similarity Techniques

Definition

Similarity can be roughly described as the measure of how much two or more objects are alike. Similarity can also be seen as the numerical distance between multiple data objects that are typically represented as value between the range of 0 (not similar at all) and 1 (completely similar). Depending on the similarity metric used the triangle inequality between objects may hold, but more generally the two properties that must be maintained for similarites is that the measure of similarity must fall within the range of 0 and 1 and symmetry. Symmetry being the property that states that for all x and for all y the similarity of x and y must be the same as the similarity of y and x.

Similarity Metrics

Euclidean Distance

Euclidean Distance between a pair of objects refers to the metric distance between the objects. This value is found by taking the root of the sum of squared differences between each of their attributes. For example, if there are two objects, A and B, with attributes x, y, and z, to determine the euclidean distance between the two one need only:

Find the differences between each pair of attributes:

(x_A- x_B)
(y_A- y_B)
(z_A- z_B)

Square the differences between each pair of attributes:

(x_A- x_B)²
(y_A- y_B)²
(z_A- z_B)²

Sum all the squared values from step 2:

Sum of Squared Differences = (x_A- x_B)²+(y_A- y_B)² +(z_A- z_B)²

Take the square root of the sum from step 3:

Euclidean Distance = sqrt( Sum of Squared Differences ) = sqrt ( (x_A- x_B)² +(y_A- y_B)² +(z_A- z_B)² )

# This is an implementation of a Euclidean Distance function in python
# as guided by the Programming Collective Intelligence book by Toby Segaran

# Returns a distance-based similarity score between the movie preferences of 
# two people
def euclidean_distance(movie_preferences,person1,person2):
	
	# Gets the list of shared items. Shared items are items that are the 
	# same between people
	shared_items={}
	for item in movie_preferences[person1]:
		if item in movie_preferences[person2]:
			shared_items[item]=1
	
	# If they have no ratings in common, return 0
	if len(shared_items)==0: return 0

	# Add up the squares of all the differences
	sum_of_squares=sum([pow(movie_preferences[person1][item]-movie_preferences[person2][item],2)
		for item in movie_preferences[person1] if item in movie_preferences[person2]])
		
	return 1/(1+sqrt(sum_of_squares))

Pearson's Correlation Coefficient

Correlation is the measure of the linear relationship between the attributes of two objects. Pearson's correlation coefficient is one such measure between two objects, A and B, such that:

Pearson's Correlation Coefficient = covariance( A , B ) / ( (standardDeviation(A) * standardDeviation(B) )

# This is an implementation of a Pearson Correlation function in python
# as guided by the Programming Collective Intelligence book by Toby Segaran
		
# Returns the Pearson correlation coefficient for between two people's
# movie preferences
def pearson_correlation_coefficient(movie_preferences,person1,person2):
	# Gets the list of mutually rated items
	mutual_items={}
	for item in movie_preferences[person1]:
		if item in movie_preferences[person2]:
			mutual_items[item]=1
			
	# Finds the number of elements in mutual items and stores in to elements
	elements=len(mutual_items)
	
	# If there are no ratings in common, return 0
	if elements==0: 
		return 0
	
	# Add up all the preferences
	sum_person1=sum([movie_preferences[person1][it] for it in mutual_items])
	sum_person2=sum([movie_preferences[person2][it] for it in mutual_items])
	
	# Sum up the squares of each person's sums
	squared_person1=sum([pow(movie_preferences[person1][it],2) for it in mutual_items])
	squared_person2=sum([pow(movie_preferences[person2][it],2) for it in mutual_items])
	
	# Sum of the products of the mutual items
	product_sum=sum([movie_preferences[person1][it]*movie_preferences[person2][it] for it in mutual_items])
	
	# Calculation of the Pearson score
	num=product_sum-(sum_person1*sum_person2/elements)
	den=sqrt((squared_person1-pow(sum_person1,2)/elements)*(squared_person2-pow(sum_person2,2)/elements))
	
	if den==0:
		return 0
	return num/den

Jaccard Similarity Coefficient

The Jaccard Coefficient is a metric that only measures the similarity between objects of purely binary attributes (similarity coefficients). Binary attributes in this sense could correspond to market basket data such that an attribute embodies an item in a store, a value of 1 represents a purchase and a value of 0 indicates something that wasn't purchased. In particular, the Jaccard coefficient is particulary suited for handling asymmetric binary attributes, because it only uses attributes that either one or both objects have purchased. This is done by taking the number of attributes that both objects have a value of one for and dividing it by the number of objects that have at least one 'purchase' value.

Jaccard Coefficient = f₁₁ / ( f₀₁ + f₁₀ + f ₁₁ )

Where f is the count of number of instances of a pair of binary values occurs
01 means that one object did not make a purchase, but the other did on an item
10 means that one object did make a purchase, but the other did not on an item
11 means that both objects made a purchase on a particular item

Tanimoto Coefficient

The Tanimoto Coefficient, otherwise known as teh extended Jaccard Coefficient is very similar in principle to the Jaccard Coefficient from before. In fact, the Tanimoto Coefficient is the general form of the Jaccard Coefficient. If the Tanimoto is restricted to working with only binary attributes then the Tanimoto Coefficient is reduced to Jaccard.

Tanimoto(x,y) = x · y / ( || x ||² + || y ||² - x · y )

Where || x || means the length of the vector x

# This is an implementation of a tanimoto coefficient function in python
# as guided by the Programming Collective Intelligence book by Toby Segaran
		
# Implementation of tanimoto coefficient, which is the ratio
# of the intersection set ( only the items that are in both sets)
# to the union set (all the items in either set). 
# This will return a value betwen 1 and 0 a value of 1 indicates
# nobody who wants the first item wants the second one, and 0
# means that exactly the same set of people want the 2 items
def tanimoto(vector1,vector2):
	set1,set2,shared=0,0,0
	
	for i in range(len(vector1)):
		if vector1[i]!=0:
			set1+=1 # in vector1
		if vector2[i]!=0: 
			set2+=1 # in vector2
		if vector1[i]!=0 and vector2[i]!=0: 
			shared+=1 # in both
		try:
			return 1.0 - (float(shared)/(set1+set2-shared))
		except:
			print "Divided by 0"

Cosine Similarity

Cosine Similarity is a similarity metric that can be used to measure the similarity of two text documents. Documents can be represented by vectors where each attribute represents the frequency of a word that appears in a document. Even though documents may have many attributes, because of the nature of the human language the word vectors are actually very sparse. Taking advantage of the same properties as Jaccard, all 00 matches are ignored in the similarity calculation. Similarly wiht the Tanimoto coefficient the Cosine Similarity can also handle non binary values. Since two documents can be represented as vectors for two vectors to be similar (disregarding magnitude) the angles must be close. So if the Cosine Similarity yields one then the angle between the two vectors is 0 degrees. Also if the Cosine Similarity is 0 then the angle between the two vectors is 90 degrees, or perpendicular, indicative to being very dissimilar.

Cosine(x,y) = ( x · y) / ( || x || || y || )

Data Mining Portfolio

Similarity Techniques

Definition

Similarity Metrics

Euclidean Distance

Pearson's Correlation Coefficient

Jaccard Similarity Coefficient

Tanimoto Coefficient

Cosine Similarity