BigSnarf blog

Infosec FTW

Monthly Archives: March 2012

Visualization: On the search for the Gold Pot at the ends of the rainbows – Gephi Visualization of Twitter Friends and Followers

Visualization: Archived all my Twitter Timeline with RStudio and built this twitter analysis graphic

Visualization: CIRCOS and Twitter Analysis of Edge Node Lists

Installing Python, MatPlotLib and iPython on Mac OSX 10.7.3

Installing Python, MatPlotLib & iPython on Mac OSX 10.7.3

Thanks to @dpbrown Thanks Daniel for the post on getting iPython on my…

Mac OS X 10.7.3 with 32bit Python 2.7.2, MatPlotLib 1.1.0 and iPython 0.12.  Note: currently only the 32 bit version of Python will work consistently with MatPlotLib and iPython.

  1. Install Python 2.7.2:
    1. Download the prebuilt ‘python-2.7.2-macosx10.3.dmg
    2. Double click the DMG image and double click the pgk.
    3. Open a terminal and running the command ‘python -V’ to verify that you have ‘Python 2.7.2′.
  2. Install MatPlotLib 1.1.0:
    1. Download the prebuilt ‘
    2. Mout the DMG image and  run the contained installer.
    3. Verify this worked by opening a terminal, running python and then ‘import matplotlib’ followed by ‘print matplotlib.__version__’ which should return ’1.0.1′.
  3. Finally iPython 0.12:
    1. Download the iPython source ‘
    2. Extract the zip file.
    3. cd into extracted directory ‘ipython-0.12′.
    4. Run the command ‘sudo python install’ and enter your password when prompted.
    5. Verify this by running iPython with MatPlotLib via ‘ipython -pylab’ and then ‘x = randn(10000)’ followed by ‘hist(x, 100)’ and a chart window like the following image should pop up like the one above.

Predictive Analytics Part 2 – Supervised and Unsupervised learning

This is the second part of my journey into making a predictive analytics engine for DFIR/Infosec.  In my previous blog entry, Unlocking Level 6, I discussed going down the rabbit hole of making your own algorithms for predictive analytics.  After spending the last week researching, I’m going to experiment with packages and software already built.  Truthfully, I just don’t have enough experience to make a reliable algorithm from thin air.

Here’s my toolbox so far:

  • Data manipulation: Python, Vim
  • Interactive analysis: Excel
  • Visualizations: Gephi, Tableau, d3.js, ggplot2, matplotlib, graphInsight, pythonD3
  • General Purpose: iPython, Python, NumPy, Scipy, Java, Chrome Developer Toolkit
  • Statistical Toolkit: R, R Studio, caret, ggplot2
  • Predictive: Mahoot, Weka, scikit-learn
  • Version Control: git
  • Cluster: 4 node CDH3
  • Search Cluster: 2 Node Elasticseach
  • Sandbox cluster: 70 node Hadoop

I’ve set some goals:

  • Goal is to create system that requires less human intervention to operate effectively. Current log analyzers are not intelligent.
  • Goal is to create a system capable of detecting known and unknown intrusions intelligently and automatically
  • Goal is to distinguishing normal network activities from those abnormal and malicious attacks with minimum human inputs
Some observations on tradition IDS:
  • Pattern Matching algorithms
  • Stateful Pattern Matching whole data stream algorithms
  • Protocol Decode-Based Analysis, that I call bounce fuzzing algorithm
  • Heuristic-Based Analysis that looks at traffic based on pre-programmed algorithmic logic
  • Anomaly Detection tries to find out anomalous actions based on the learning of its previous training experience with patterns assumed as normal
I learned that there are two types of learning algorithms 1. Supervised Learning 2. Unsupervised Learning. Supervised learning is the machine learning task of inferring a function from supervised (labeled) training data. In machine learningunsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Below are two types of methods that I like:
  • Classification will be used to predict and bucket attacks abnormal vs normal attacks
  • Regression will be used to predict an output value as attacks abnormal vs. normal attacks
List of some of the Supervised Learning Algorithms:
  • Neural Networks
  • Naive Bayes
  • Nearest Neighbor
  • Regression models
  • Support Vector Machines (SVMs)
  • Decision Trees
List some of the Unsupervised Learning Algorithms:
  • K-nearest neighbor
  • Neural network based approaches for meeting a threshold
  • Partial based clustering
  • Hierarchical clustering
  • Probabilistic based clustering
  • Gaussian Mixture Modelling (GMM) models
Some applications using learning algorithms:
  • SPAM detection
  • Handwriting detection
  • Google Streetview
  • Speech recognition
  • Facial Recognition
  • Netflix recommendation
  • Robotic navigation
In the next blog post I will select a couple of methods to detect abnormal traffic. I will set up and environment and experiment with the KDD99Cup dataset and see if we can identify abnormal traffic.  I have looked at the using the KDD 99 Cup dataset from DARPA.  Types of traffic to analyze:
    –Class 0 normal
    –Class 1 probe
    –Class 2 denial of service (DOS)
    –Class 3 user-to-root (U2R)
    –Class 4 remote-to-local (R2L)
Learn More:
This course link below is an Introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). (iv) Reinforcement learning. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

This list shows some of the methods you can use with PyBrain.

Supervised Learning

  • Back-Propagation
  • R-Prop
  • Support-Vector-Machines (LIBSVM interface)
  • Evolino

Unsupervised Learning

  • K-Means Clustering
  • PCA/pPCA
  • LSH for Hamming and Euclidean Spaces
  • Deep Belief Networks

Andrew Ng video explains linear regression with one variable – Intro to Machine Learning



Awesome Visual Exploration of Time with Timesearcher from HCIL

Lifelines is another visual analysis tool for categorized data

Creating my first algorithm from scratch – Euclidean distance and Pearson correlation

For this part of the exercise, I look at 2 IP Address and calculate similarity using Euclidean distance and Pearson correlation. I created a small dataset that is a nested dictionary. I did manual calculations, but python’s Pandas can work the numbers easily. I calculate the distance of Lisa from Kirk by isolating and and plot those on a graph.  I do it for each of the combinations of people and each of the combinations of IP addresses. I even find people that are very similar and one that is not as similar.  This model can help understand clusters and identify baseline conversations between people and visited IP addresses. Somehow it all makes sense to me.

talkers={‘Lisa’: {’′: 2.5, ’′: 3.5,
’′: 3.0, ’′: 3.5, ’′: 2.5,
’′: 3.0},
‘Kirk’: {’′: 3.0, ’′: 3.5,
’′: 1.5, ’′: 5.0, ’′: 3.0,
’′: 3.5},
‘Phillip’: {’′: 2.5, ’′: 3.0,
’′: 3.5, ’′: 4.0},
‘Dan’: {’′: 3.5, ’′: 3.0,
’′: 4.5, ’′: 4.0,
’′: 2.5},
‘James’: {’′: 3.0, ’′: 4.0,
’′: 2.0, ’′: 3.0, ’′: 3.0,
’′: 2.0},
‘Britney’: {’’: 3.0, ’′: 4.0,
’′: 3.0, ’′: 5.0, ’′: 3.5},
‘Toby’: {’′:4.5,’′:1.0,’′:4.0}}

from math import sqrt 
# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
                      for item in prefs[person1] if item in prefs[person2]]) 
  return 1/(1+sum_of_squares) 

Trying to unlock Level 6 Achievement – Predictive Analytics

Level 6 Challenge – Predictive Modeling – Attack Simulation – War Games

Organizations have created thousands of models and have a solid understand of the business and priorities. The organization is planning to use predictive analytics and statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events. These reports require heavy interaction of the BI, visualization, and Infosec teams to produce real validated results.

My journey into Data Mining

Following up on my recent blog post on Infosec and Big Data, I decided to write more about the journey into my though process and investigation of how the Infosec industry is going to change. The next few posts will detail some learnings around building a variety of algorithms for a Big Data system.

I’ve learned that data mining involves the selection of 2 paths.  Look at data to explain the past or use data to explain the future.  A variety of algorithms have been created by SIEM vendors to identify attacks.  Most of the algorithms are easy to reproduce for simple attacks like DOS and DDOS.  Any attacks that flood, brute-force or break a threshold are easy to make. Twitter, Facebook, Netflix, Google and “put-your-social-here” have figured out how to data mine data/machine logs.

Data Mining —- Future   —- Modelling

\_  Past   ___ Exploration

From my own experience, I have found it trivial to create a few examples to “Data Mine the Past”.  There are literally thousands of examples on how to data mine social. Everyone and their aardvarks have used a tool to “data mine” and visualized past data. I’m finding it less trivial to create “predictive models of the future”. Below is a graphic I found that explained the connection of Data Mining to other concepts.

Making your own “predictive models of the future”

Predictive Models are also known as “machine learning” and also known as “pattern recognition”.  Many models use one model and give the user one answer.  These one2one models are formula based models. The first challenge I ran into was to decide what tool to use to model in. R? Python? Weka? Mahout?.  Next was finding a variety of data sets, cleaning the data sets, and loading them up on the tool of my choice.

NOTE: All solutions must use Hadoop. Googling “predictive model marketplace” didn’t help much.  Why isn’t there a place on the web where people can freely share predictive models?

The next choice is finding or choosing the model I will experiment with. In predictive models, you have 4 choices to choose from: 1. Classification 2. Regression 3. Clustering 4. Association Rules.

Data Mining —- Future   —- Modelling —- Classification

\_  Regression

\_  Clustering

\_  Association Rules

I have identified 3 predictive algorithms to start with to discover network attacks. I will create simple algorithms in each of the categories of k-means clustering, k-nearest neighbour, and association rules.  A variety of papers can be found simply by searching Google for each type of algorithm concatenated with “network attacks”.  There is quite a bit of math and theory around these techniques that I am not familiar with.  Notwithstanding, I will try to explain and create trivial predictive algorithms in my next series of blog posts.

Read the second instalment of this blog series on building your own Predictive Analytics Engine on k-means clustering, k-nearest neighbour, and association rules.

Big Data Infosec – Bigsnarf Open Source Solution

To start the conversation off on my Big Data Infosec journey, I created this placemat to consider where Infosec might end up. This is an example of my first experiments with visualizing data with Hadoop and Hive. It is on the same idea as packetpig. (Link: PDF). I think that Big Data, gives Information Security, a second chance to “do it better”.

Last year, while building my POC, I found Wayne’s SherpaSurfing solution. (Link:Slideshare) and though that Big Data and Infosec could be better. In Ben’s post (Link:Blog), a sobering discussion around are we winning or losing. Scott Crawford has been discussing data-driven security with his (Link: blog series). George followed up with his post (Link:Website) which I interpreted as “I think we’re in a precarious spot”. Here’s the RSA panel’s position on the topic (Link:Website). Andrew suggests that SIEM aren’t providing value buy rebranding with Big Data “buzzword” stickers (Link:Website). Packetloop presented their interpretation of Big Data Infosec at Blackhat EU 2012 (Link:PDF). Raffael followed up recently with his post (Link:Website). Moar visual analytics! Ed in this post suggested that Infosec stop using “stoplight reports” and using different metrics to get a situational aware (Link:Website).

Level 1 – Data Collection

  • Organizations in this level have data silos spitting out a variety of log and machine data collected from various sources and “Enterprise Security” systems. Most organization need humans to connect the silo’d data to interpret results. This in not very good position to be in because there is a severe reliance on humans processing the data.

Level 2 – Big Data Aggregation

  • Organizations in this level have some semblance of a plan and big data strategy. The organization is focused on integration activities to get the silo’d data in some sort of Data Warehouse/Hadoop/(name your flavour of big data technology here). A POC system is providing some insight/reporting that has traditionally required an Infosec analyst to produce.

Level 3 – Basic Tools of Analysis

  • Organizations have managed to stockpile months/years worth of data for a “data scientist” and “Infosec analyst” to spend time producing standard charts and reports already produced by other silo’d systems. The difference is the data mining/pattern matching focus on the complete Infosec dataset. Adhoc reports are pushed out of this group, but there is still quite a bit of hand holding to get a report generated. Hundreds of jobs are run nightly to produce the first Big Data Infosec KPI’s and metric reports.

Level 4 – Data Enrichment / ETL / Real-time Queries

  • Organizations have managed to get several big data and stats experts on staff. A mature system is in place. Migration plans are being executed for ever-greening and formal training plans for users and power users of the system. In this level, it’s now time to look at another system that focuses on taking the best of the best of the Hadoop Gen1 system and creating data specifically for real time queries, visual analytics, drill down analysis, exploration and analysis, and serious BI digging. Teams are dedicated to getting every ounce of efficiency of this system. DFIR, SOC, NOC, CISO tower, e-discovery, audit, and compliance, all routinely use this system as the “private Google search engine” to answer on demand questions, adhoc queries. Routine answers at the touch of a submit button.

Level 5 – Business Intelligence

  • Organizations have made a decision to open the data gates and formally allow users self serve, limited requests to the system. Real BI technologies provide historical, current of business operations. Reporting, dynamic reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining are typically “published” to internal company websites. Business intelligence aims to support better business decision-making.

Level 6 – Predictive Model – Attack Simulation – War Games

  • Joshua: Shall we play a game? Organizations have created thousands of models and have a solid understand of the business and priorities. The organization is planning to use predictive analytics and statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events. These reports require heavy interaction of the BI, visualization, and Infosec teams to produce real validated results.

Level 7 – What If Scenario’s by CISO

  • Organizations are confident enough to let the CISO have query access and access to the system without a team being present. This really is the “dream” dashboard that every CISO wants but never gets. S/He can self serve plausible outcomes to any question they have. True value of the Information Security organization can easily be shared, like “Tweets from an iPhone.” At this level, the CISO has a cape under his/her suit.

Level 8 – Data Democracy

  • This is where you set your data free. Free as in let any users have access to query the system. Users, administrator, outside institutions, and Interwebs are publicly to query your Infosec Big Data. At this level your Infosec organization is at a new level of transparency. Why does Infosec always guessing or investigating what’s normal or abnoral? Why is one person trying understand everyone elses data and patterns? At this level every user is empowered to participate in their own security discovery? Social Collective Security Intelligence (SCSI?). Infosec could be considered “social” at this level opposed to secretive. Users consume and routinely self serve vanity queries to “pump their own ego’s” because the have access to “life statistics”. (

Where does your organization fall into the Big Data Infosec Maturity scale? Wanna read more on building your own Predictive Analytics Engine blog series?