For this part of the exercise, I look at 2 IP Address and calculate similarity using Euclidean distance and Pearson correlation. I created a small dataset that is a nested dictionary. I did manual calculations, but python’s Pandas can work the numbers easily. I calculate the distance of Lisa from Kirk by isolating 1.1.1.1 and 2.2.2.2 and plot those on a graph. I do it for each of the combinations of people and each of the combinations of IP addresses. I even find people that are very similar and one that is not as similar. This model can help understand clusters and identify baseline conversations between people and visited IP addresses. Somehow it all makes sense to me.
talkers={‘Lisa’: {’1.1.1.1′: 2.5, ’2.2.2.2′: 3.5,
’3.3.3.3′: 3.0, ’4.4.4.4′: 3.5, ’5.5.5.5′: 2.5,
’6.6.6.6′: 3.0},
‘Kirk’: {’1.1.1.1′: 3.0, ’2.2.2.2′: 3.5,
’3.3.3.3′: 1.5, ’4.4.4.4′: 5.0, ’6.6.6.6′: 3.0,
’5.5.5.5′: 3.5},
‘Phillip’: {’1.1.1.1′: 2.5, ’2.2.2.2′: 3.0,
’4.4.4.4′: 3.5, ’6.6.6.6′: 4.0},
‘Dan’: {’2.2.2.2′: 3.5, ’3.3.3.3′: 3.0,
’6.6.6.6′: 4.5, ’4.4.4.4′: 4.0,
’5.5.5.5′: 2.5},
‘James’: {’1.1.1.1′: 3.0, ’2.2.2.2′: 4.0,
’3.3.3.3′: 2.0, ’4.4.4.4′: 3.0, ’6.6.6.6′: 3.0,
’5.5.5.5′: 2.0},
‘Britney’: {’1.1.1.1.’: 3.0, ’2.2.2.2′: 4.0,
’6.6.6.6′: 3.0, ’4.4.4.4′: 5.0, ’5.5.5.5′: 3.5},
‘Toby’: {’2.2.2.2′:4.5,’5.5.5.5′:1.0,’4.4.4.4′:4.0}}
from math import sqrt
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sum_of_squares)