Euclidean Distance between a pair of objects refers to the metric distance between the objects. This value is found by taking the root of the sum of squared differences between each of their attributes. For example, if there are two objects, A and B, with attributes x, y, and z, to determine the euclidean distance between the two one need only:
# This is an implementation of a Euclidean Distance function in python # as guided by the Programming Collective Intelligence book by Toby Segaran # Returns a distance-based similarity score between the movie preferences of # two people def euclidean_distance(movie_preferences,person1,person2): # Gets the list of shared items. Shared items are items that are the # same between people shared_items={} for item in movie_preferences[person1]: if item in movie_preferences[person2]: shared_items[item]=1 # If they have no ratings in common, return 0 if len(shared_items)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(movie_preferences[person1][item]-movie_preferences[person2][item],2) for item in movie_preferences[person1] if item in movie_preferences[person2]]) return 1/(1+sqrt(sum_of_squares))
Correlation is the measure of the linear relationship between the attributes of two objects. Pearson's correlation coefficient is one such measure between two objects, A and B, such that:
# This is an implementation of a Pearson Correlation function in python # as guided by the Programming Collective Intelligence book by Toby Segaran # Returns the Pearson correlation coefficient for between two people's # movie preferences def pearson_correlation_coefficient(movie_preferences,person1,person2): # Gets the list of mutually rated items mutual_items={} for item in movie_preferences[person1]: if item in movie_preferences[person2]: mutual_items[item]=1 # Finds the number of elements in mutual items and stores in to elements elements=len(mutual_items) # If there are no ratings in common, return 0 if elements==0: return 0 # Add up all the preferences sum_person1=sum([movie_preferences[person1][it] for it in mutual_items]) sum_person2=sum([movie_preferences[person2][it] for it in mutual_items]) # Sum up the squares of each person's sums squared_person1=sum([pow(movie_preferences[person1][it],2) for it in mutual_items]) squared_person2=sum([pow(movie_preferences[person2][it],2) for it in mutual_items]) # Sum of the products of the mutual items product_sum=sum([movie_preferences[person1][it]*movie_preferences[person2][it] for it in mutual_items]) # Calculation of the Pearson score num=product_sum-(sum_person1*sum_person2/elements) den=sqrt((squared_person1-pow(sum_person1,2)/elements)*(squared_person2-pow(sum_person2,2)/elements)) if den==0: return 0 return num/den
The Jaccard Coefficient is a metric that only measures the similarity between objects of purely binary attributes (similarity coefficients). Binary attributes in this sense could correspond to market basket data such that an attribute embodies an item in a store, a value of 1 represents a purchase and a value of 0 indicates something that wasn't purchased. In particular, the Jaccard coefficient is particulary suited for handling asymmetric binary attributes, because it only uses attributes that either one or both objects have purchased. This is done by taking the number of attributes that both objects have a value of one for and dividing it by the number of objects that have at least one 'purchase' value.
The Tanimoto Coefficient, otherwise known as teh extended Jaccard Coefficient is very similar in principle to the Jaccard Coefficient from before. In fact, the Tanimoto Coefficient is the general form of the Jaccard Coefficient. If the Tanimoto is restricted to working with only binary attributes then the Tanimoto Coefficient is reduced to Jaccard.
# This is an implementation of a tanimoto coefficient function in python # as guided by the Programming Collective Intelligence book by Toby Segaran # Implementation of tanimoto coefficient, which is the ratio # of the intersection set ( only the items that are in both sets) # to the union set (all the items in either set). # This will return a value betwen 1 and 0 a value of 1 indicates # nobody who wants the first item wants the second one, and 0 # means that exactly the same set of people want the 2 items def tanimoto(vector1,vector2): set1,set2,shared=0,0,0 for i in range(len(vector1)): if vector1[i]!=0: set1+=1 # in vector1 if vector2[i]!=0: set2+=1 # in vector2 if vector1[i]!=0 and vector2[i]!=0: shared+=1 # in both try: return 1.0 - (float(shared)/(set1+set2-shared)) except: print "Divided by 0"
Cosine Similarity is a similarity metric that can be used to measure the similarity of two text documents. Documents can be represented by vectors where each attribute represents the frequency of a word that appears in a document. Even though documents may have many attributes, because of the nature of the human language the word vectors are actually very sparse. Taking advantage of the same properties as Jaccard, all 00 matches are ignored in the similarity calculation. Similarly wiht the Tanimoto coefficient the Cosine Similarity can also handle non binary values. Since two documents can be represented as vectors for two vectors to be similar (disregarding magnitude) the angles must be close. So if the Cosine Similarity yields one then the angle between the two vectors is 0 degrees. Also if the Cosine Similarity is 0 then the angle between the two vectors is 90 degrees, or perpendicular, indicative to being very dissimilar.