Pearson correlation coefficient
By definition, Pearson correlation coefficient is a measure of a linear correlation between two variables (between -1 and 1). It can be used when you are trying to find a smilarity between entities.
Let’s say that you have a list of movie ratings:
data = {
'Davor' => {
'Stonehearst Asylum' => 5.0,
'Predestination' => 4.0,
'Shutter Island' => 5.0,
'The Witch' => 2.0,
'The Reaping' => 3.0
},
'John' => {
'Shutter Island' => 4,
'Predestination' => 4.0,
'Back to the future' => 5.0,
'The Godfather' => 1.5,
'The Witch' => 3
},
'Tina' => {
'Back to the future' => 3.0,
'Predestination' => 2.0,
'The Witch' => 5.0,
'Shutter Island' => 3
}
}
Now, the easiest way to find a difference between two ratings is the Euclidean distance.
Formula for the Euclidean distance
There are few problems with ED:
- it’s between 0 and ∞ where 0 means entities are the same (but this could be easily scaled to [0, 1])
- it doesn’t quantify how well two data objects fit a line
- difference between normalized and unnormalized data
On the other hand, Pearson correlation coefficient handles these issues pretty well. Let’s say that you have a user who rates almost all (good movies) movies with a 3 (and other movies with 1 and 2). You could easily say that his 3 is actually a 5. Pearson correlation coefficient does that normalization.
Formula for the Pearson correlation coefficient
Pearson correlation coefficient written in Ruby:
def pearson(data, person1, person2)
shared = Hash[*(data[person1].keys & data[person2].keys).flat_map{|key| [key, 1]}]
len = shared.length
return 0 if len == 0
sum1 = sum2 = sum1_sq = sum2_sq = psum = 0
shared.each_key do |key|
sum1 += data[person1][key].to_f
sum2 += data[person2][key].to_f
sum1_sq += data[person1][key].to_f ** 2
sum2_sq += data[person2][key].to_f ** 2
psum += data[person1][key].to_f*data[person2][key].to_f
end
num = psum - (sum1 * sum2 / len)
den = ((sum1_sq - (sum1 ** 2)/len) * (sum2_sq - (sum2 ** 2) / len)) ** 0.5
den == 0 ? 0 : num / den
end
Let’s try it out:
pearson(data, 'Davor', 'John')
#=> 0.9449111825230686
pearson(data, 'Davor', 'Tina')
#=> -0.7857142857142856