BigSnarf blog

Infosec FTW

Creating my first algorithm from scratch – Euclidean distance and Pearson correlation

For this part of the exercise, I look at 2 IP Address and calculate similarity using Euclidean distance and Pearson correlation. I created a small dataset that is a nested dictionary. I did manual calculations, but python’s Pandas can work the numbers easily. I calculate the distance of Lisa from Kirk by isolating 1.1.1.1 and 2.2.2.2 and plot those on a graph.  I do it for each of the combinations of people and each of the combinations of IP addresses. I even find people that are very similar and one that is not as similar.  This model can help understand clusters and identify baseline conversations between people and visited IP addresses. Somehow it all makes sense to me.

talkers={‘Lisa’: {’1.1.1.1′: 2.5, ’2.2.2.2′: 3.5,
’3.3.3.3′: 3.0, ’4.4.4.4′: 3.5, ’5.5.5.5′: 2.5,
’6.6.6.6′: 3.0},
‘Kirk’: {’1.1.1.1′: 3.0, ’2.2.2.2′: 3.5,
’3.3.3.3′: 1.5, ’4.4.4.4′: 5.0, ’6.6.6.6′: 3.0,
’5.5.5.5′: 3.5},
‘Phillip’: {’1.1.1.1′: 2.5, ’2.2.2.2′: 3.0,
’4.4.4.4′: 3.5, ’6.6.6.6′: 4.0},
‘Dan’: {’2.2.2.2′: 3.5, ’3.3.3.3′: 3.0,
’6.6.6.6′: 4.5, ’4.4.4.4′: 4.0,
’5.5.5.5′: 2.5},
‘James’: {’1.1.1.1′: 3.0, ’2.2.2.2′: 4.0,
’3.3.3.3′: 2.0, ’4.4.4.4′: 3.0, ’6.6.6.6′: 3.0,
’5.5.5.5′: 2.0},
‘Britney’: {’1.1.1.1.’: 3.0, ’2.2.2.2′: 4.0,
’6.6.6.6′: 3.0, ’4.4.4.4′: 5.0, ’5.5.5.5′: 3.5},
‘Toby’: {’2.2.2.2′:4.5,’5.5.5.5′:1.0,’4.4.4.4′:4.0}}

from math import sqrt 
# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]]) 
  return 1/(1+sum_of_squares) 

Leave a comment