Published on Sep 04,2014
1.4K Views
Email Post

Tanimoto coefficient is the ratio of the size of the intersection, or overlap between the preferred items of two users, to the total of the users’ preferred items. TanimotoCoefficientSimilarity is an implementation, derived from the Tanimoto coefficient. It is also known as the Jaccard coefficient. It can also be defined as the number of items, for which the two users express some preference, divided by the number of items, for which either of the users express some preference.

Based on the above definition, the question that arises is, what exactly Tanimoto does and how would it find out that these two users are similar?

Exemplification of Tanimoto Coefficient

Let’s assume there are two users, and it will take count of those preferences which are coming as common for the users, user1 and user2. The part of intersection, which is common to both, will be divided by total preferences by the users or the total preferred items by both the users; and based on that number, it will find out how closely related these two items are.

If you just look at the above diagram, it shows that both the users have same item in the list. In this case, what will happen is, the intersection would be covering all the items, and obviously all the items are same. So, the coefficient would be 1, which means those two users are almost similar or same. The preferences of both the users, which are matching, divided by total preferred items by both the users individually gives Tanimoto Coefficient.

It is actually related with the LogLikelihood Similarity, which is used if the preference values are not there. LogLikelihood Similarity is little bit related to the statistical part, where you create your own hypothesis.

Let’s assume you want to analyse some data. For this, there are certain rules in statistics. We will create a hypothesis that is based on the data.

The assumption here is whatever hypothesis we are creating is not true, but there would be counters also.

We’ll take an example of Sensex and the government. Based on some data, we are saying that ‘When the Indian government will change Sensex will go up’. To prove this, we will have to create two hypotheses.

Hypothesis 1: The government will change and Sensex will change.

Hypothesis 2: The government will change and Sensex will not change.

In this case, the second one, ‘Sensex will change’ can be called as null hypothesis, which means whatever be the assumption, this hypothesis is going to be true. Then, the second counter-hypothesis we will take is ‘government will change but sensex can decrease’, because we are assuming that Sensex will increase. So, that would be the second hypothesis. Now, based on the data, we will try to come up with a result to make out as to which hypothesis is correct, which is a statistical thing.

Now, how does LogLikelihood Similarity come into picture?

It is almost the same as the Tanimoto coefficient. It also takes care of the intersection part; but definition-wise it focuses on how unlikely it is that these two users are not having the intersection of the preferred items, i.e. it will just behave in an opposite way. Thus, if it is more unlikely the coefficient would be low.

Got a question for us? Mention them in the comments section and we will get back to you.

Related Posts:

Fuzzy K-Means Clustering in Mahout

Supervised Learning in Apache Mahout

Start your Machine Learning with Mahout