Using machine learning to identify password hacking attempts
December 20, 2012
Posted by on
Adversarial machine learning – “the study of eﬀective machine learning techniques against an adversarial opponent”
Looking at failed attempts is just one feature in trying to identify password hacking attempts. Above is a traditional dashboard for helping humans identify passwords usage abuse. Finding “badguys” in the sea of authenticated users is not trivial. Building an algorithm to catch “badguys” seems like a tough problem.
Machine learning can help identify potential abuses. Below is a few features of passwords that I can think of right now:
- Source IP address
- Browser information
- User agent string
- Cookies in the browser
- Time of the login
- Location of login
- Incorrect password guesses
- Origin-bound certificates
- One-time SMS codes
- One-time email code requests
- One-time website code requests
- Typing dynamics and history
- Additional authentication from unknown devices
- Secondary authentication on top of password
- Usage behaviour
- Password change behaviour
- New User login locations
- New User login IP
- New User login devices
- Time of day probability of login
- Location probability of login
- Device probability of login
- Destination IP probability of login
- Login attempts from small sized botnets
- Login attempts from large sized botnets – Million+
Potential Machine Learning Algorithm
- Google’s Page Rank variation algorithm that leverages prior knowledge of both malicious, benign domains, new domain for rank assignment encountered from user logins, locations, etc based on the above features.
- Personalized Page Rank variation algorithm that focuses on what happens just before and just after logins combined with features learning in the above algorithm.
- Fast Flux domains algorithm and popup IP addresses that appear to be new or are only valid for a limited period of time. Combined with potential information from spam nets, botnets, Google safe browsing, malware, blacklists and whitelists can be leveraged to alert on high probability of risk.
- Password history analysis algorithm interest in password changes from a combo of above algorithms and patterns in password changes. Time, location, device, passwords, password history, password change history, source and destination are all strong features.
- User modeling: age, sex, group, location
- User recommendation: cosine similarity, collaborative filtering, ARL ranking users. Maybe even features based on login histories to identify genuine logins.
- User reputation: pagerank for user reputations or weighted pagerank
- Analysis on pairwise user interactions, and logistic regression model to predict strength of user ties
- Analytics and metrics by country, language, user login history using cohort and session analysis identify anomalies
- Instead of hashtag, geotag, entities or conversation threads, these features can be substituted for logins, servers contacted, geo etc.
- Top users and user rankings based on contact, combined with recency metrics and calendar metrics