By definition, Pearson correlation coefficient is a measure of a linear correlation between two variables (between -1 and 1). It can be used when you are trying to find a smilarity between entities.

Let’s say that you have a list of movie ratings:

data = {
  'Davor' => {
    'Stonehearst Asylum' => 5.0,
    'Predestination' => 4.0,
    'Shutter Island' => 5.0,
    'The Witch' => 2.0,
    'The Reaping' => 3.0
  },
  'John' => {
    'Shutter Island' => 4,
    'Predestination' => 4.0,
    'Back to the future' => 5.0,
    'The Godfather' => 1.5,
    'The Witch' => 3
  },
  'Tina' => {
    'Back to the future' => 3.0,
    'Predestination' => 2.0,
    'The Witch' => 5.0,
    'Shutter Island' => 3
  }
}

Now, the easiest way to find a difference between two ratings is the Euclidean distance.

Euclidean distance Formula for the Euclidean distance

There are few problems with ED:

  1. it’s between 0 and ∞ where 0 means entities are the same (but this could be easily scaled to [0, 1])
  2. it doesn’t quantify how well two data objects fit a line
  3. difference between normalized and unnormalized data

On the other hand, Pearson correlation coefficient handles these issues pretty well. Let’s say that you have a user who rates almost all (good movies) movies with a 3 (and other movies with 1 and 2). You could easily say that his 3 is actually a 5. Pearson correlation coefficient does that normalization.

Pearson correlation coefficient Formula for the Pearson correlation coefficient

Pearson correlation coefficient written in Ruby:

def pearson(data, person1, person2)

  shared = Hash[*(data[person1].keys & data[person2].keys).flat_map{|key| [key, 1]}]

  len = shared.length

  return 0 if len == 0

  sum1 = sum2 = sum1_sq = sum2_sq = psum = 0

  shared.each_key do |key|
    sum1 += data[person1][key].to_f
    sum2 += data[person2][key].to_f

    sum1_sq += data[person1][key].to_f ** 2
    sum2_sq += data[person2][key].to_f ** 2

    psum += data[person1][key].to_f*data[person2][key].to_f
  end

  num = psum - (sum1 * sum2 / len)
  den = ((sum1_sq - (sum1 ** 2)/len) * (sum2_sq - (sum2 ** 2) / len)) ** 0.5

  den == 0 ? 0 : num / den
end

Let’s try it out:

pearson(data, 'Davor', 'John')
#=> 0.9449111825230686

pearson(data, 'Davor', 'Tina')
#=> -0.7857142857142856