BigSnarf blog

Infosec FTW

Predictive Analytics Part 2 – Supervised and Unsupervised learning

This is the second part of my journey into making a predictive analytics engine for DFIR/Infosec.  In my previous blog entry, Unlocking Level 6, I discussed going down the rabbit hole of making your own algorithms for predictive analytics.  After spending the last week researching, I’m going to experiment with packages and software already built.  Truthfully, I just don’t have enough experience to make a reliable algorithm from thin air.

Here’s my toolbox so far:

  • Data manipulation: Python, Vim
  • Interactive analysis: Excel
  • Visualizations: Gephi, Tableau, d3.js, ggplot2, matplotlib, graphInsight, pythonD3
  • General Purpose: iPython, Python, NumPy, Scipy, Java, Chrome Developer Toolkit
  • Statistical Toolkit: R, R Studio, caret, ggplot2
  • Predictive: Mahoot, Weka, scikit-learn
  • Version Control: git
  • Cluster: 4 node CDH3
  • Search Cluster: 2 Node Elasticseach
  • Sandbox cluster: 70 node Hadoop

I’ve set some goals:

  • Goal is to create system that requires less human intervention to operate effectively. Current log analyzers are not intelligent.
  • Goal is to create a system capable of detecting known and unknown intrusions intelligently and automatically
  • Goal is to distinguishing normal network activities from those abnormal and malicious attacks with minimum human inputs
Some observations on tradition IDS:
  • Pattern Matching algorithms
  • Stateful Pattern Matching whole data stream algorithms
  • Protocol Decode-Based Analysis, that I call bounce fuzzing algorithm
  • Heuristic-Based Analysis that looks at traffic based on pre-programmed algorithmic logic
  • Anomaly Detection tries to find out anomalous actions based on the learning of its previous training experience with patterns assumed as normal
I learned that there are two types of learning algorithms 1. Supervised Learning 2. Unsupervised Learning. Supervised learning is the machine learning task of inferring a function from supervised (labeled) training data. In machine learningunsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Below are two types of methods that I like:
  • Classification will be used to predict and bucket attacks abnormal vs normal attacks
  • Regression will be used to predict an output value as attacks abnormal vs. normal attacks
List of some of the Supervised Learning Algorithms:
  • Neural Networks
  • Naive Bayes
  • Nearest Neighbor
  • Regression models
  • Support Vector Machines (SVMs)
  • Decision Trees
List some of the Unsupervised Learning Algorithms:
  • K-nearest neighbor
  • Neural network based approaches for meeting a threshold
  • Partial based clustering
  • Hierarchical clustering
  • Probabilistic based clustering
  • Gaussian Mixture Modelling (GMM) models
Some applications using learning algorithms:
  • SPAM detection
  • Handwriting detection
  • Google Streetview
  • Speech recognition
  • Facial Recognition
  • Netflix recommendation
  • Robotic navigation
In the next blog post I will select a couple of methods to detect abnormal traffic. I will set up and environment and experiment with the KDD99Cup dataset and see if we can identify abnormal traffic.  I have looked at the using the KDD 99 Cup dataset from DARPA.  Types of traffic to analyze:
    –Class 0 normal
    –Class 1 probe
    –Class 2 denial of service (DOS)
    –Class 3 user-to-root (U2R)
    –Class 4 remote-to-local (R2L)
Learn More:
This course link below is an Introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). (iv) Reinforcement learning. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

This list shows some of the methods you can use with PyBrain.

Supervised Learning

  • Back-Propagation
  • R-Prop
  • Support-Vector-Machines (LIBSVM interface)
  • Evolino

Unsupervised Learning

  • K-Means Clustering
  • PCA/pPCA
  • LSH for Hamming and Euclidean Spaces
  • Deep Belief Networks

2 responses to “Predictive Analytics Part 2 – Supervised and Unsupervised learning

  1. Pingback: Learning to apply machine learning to the KDD CUP 99 data set « The World's Oldest Intern

  2. Pingback: Python Machine Learning with the KDD Cup 1999 Attack Data Set | The World's Oldest Intern

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: