BigSnarf blog

Infosec FTW

Monthly Archives: March 2012

Using association rules to build patterns of attacks and normal traffic

In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern  {onions -> potatoes => burgers}.  These items or patterns, were found in the sales data of a supermarket. Analysis indicates,  that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement.  Stores can also put onions on sale hoping to get more sales in potatoes and burgers.

So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?

Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions.  Finding an attacker on your network becomes overwhelming.  I have more questions than answers at this point.

Introducing Apache Mahout – Association Rule Algorithm:

Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper:

Another possible solution for an open source reliable association rules algorithm:

RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010.  It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source:

Examples of data sets:

Solutions for BigData ingest of RAM and HDD Files: Analyzing resident memory and processing hard drive files with Hadoop

To process HDD, the solution I found was: The Sleuth Kit Hadoop Framework is a framework that incorporates TSK into cloud computing for large scale data analysis.

To image and process RAM files

I also found these papers:

Solutions for Big Data ingest of network traffic – Analyzing PCAP traffic with Hadoop

Building a system that can do full context PCAP for a single machine is trivial, IMHO compared to creating predictive algorithms for analyzing PCAP traffic.  There are log data search solutions like Elasticsearch, GreyLog2, ELSA, Splunk and Logstash that can help you archive and dig through the data.

My favorite network traffic big data solution (2012) is PacketPig. In 2014 I noticed another player named packetsled. I found this nice setup by Alienvault. Security Onion, BRO IDS is a great network security IDS etc distro. I have seen one called xtractr, MR for forensics. Several solutions exist and PCAP files can be fed to the engines for analysis. I think ARGUS  and Moloch(PCAP Elasticsearch) have a place here too, but I haven’t tackled it yet. There’s a DNS Hadoop presentation from Endgame clairvoyant-squirrel.

I started using PCAP to CSV conversion perl program, and written my own sniffer to csv in scapy. Super Timelines are being done in python too. Once I get a PCAP file converted to csv, I load it up to HDFS via HUE. I also found this PCAP visualization blog entry by Raffael Marty.

I’ve stored a bunch of csv network traces and did analysis using HIVE and PIG queries. It was very simple. Name the columns and query each column looking for specific entries. Very labour intensive. Binary analysis on Hadoop.

I’m working on a MapReduce library that uses machine learning to classify attackers and their network patterns. As of 2013, there are a few commercial venders like IBM and RSA which have added Hadoop capability to their SIEM product lines. Here is Twitters logging setup. In 2014 I loaded all the csv attack data into CDH4 cluster with Impala query engine. I’m also looking at writing pandas dataframes to Googles Big Query. As of 2014 there are solutions on hadoop for malware analysis , forensics , DNS data mining.

I have been recently (2014) using Spark and PySpark on Hadoop to use the HDFS data system with the REPL doing real-time interactive query of datasets. I going to be integrating all of my IPython Notebooks and learning the Machine Learning library built on Spark. It’s going to be an awesome year!

There are a few examples of PCAP ingestion with open source tools like Hadoop:

First one I found was P3:

The second presentation I found was Wayne Wheelers – SherpaSurfing and

The third I found was

The fourth project I found was presented at BlackHatEU 2012 by PacketLoop and

Screen Shot 2012-11-30 at 11.15.22 AM

Using WEKA classification and clustering algorithms

In this post we will look at another sample dataset on predicting the pricing of a car, using a classification tree or decision tree algorithm. First we load the data:

In this example we are using a clustering algorithm on the data.  We load the data:

WEKA created 4clusters:

Read more:

Using WEKA for data mining and predictive learning

What is data mining? I didn’t know and it sounded like secret magic stuff. After digging, mind the pun, I found out it’s alot of advanced math beyond simple addition and subtraction.  The math is around algorithms that do different things.  I found that there is a deep science to working in the data mining and predictive learning field.  It is different from Artificial Intelligence and Statistics. In the past few months I have learned from online resources like STATS202 course. I learned about Machine Learning from Andrew Ng class. I learned about AI from Introduction to Artificial IntelligenceI learned about munging data with Pandas Python. I learned how to rip Twitter and weblogs with RI learned about using Pentaho PDI and Hadoop.

I’m now learning about WEKA and data mining and munging data to get into the ARFF format for predictive analysis.  The secret magic stuff is gone, and I realize that getting the data into the tool of your choice is hard part.  Getting accurate results from your tools is hard stuff. Finding case studies and examples for Infosec is also hard to do. The goal of data mining is to create a model that can help you interpret your data.  Add visual analytics and you have the recipe to process your data and gain some insight to action.


Using ARFF sample data set for simple linear regression

We have created an algorithm for predicting housing prices based on the sample dataset:

Linear Regression Model

sellingPrice =

-26.6882 * houseSize +
7.0551 * lotSize +
43166.0767 * bedrooms +
42292.0901 * bathroom +

Basically plugging in the 4 variables like houseSize, lotSize, bedrooms and bathroom we can calculate sellingPrice. 

Read more:

Installing Pandas, iPython Notebook on Backtrack5

Required packages

  • curl
  • pip
  • Python 2.7.2
  • Distribute 0.6.25
  • NumPy 1.6.1
  • SciPy 0.9
  • matplotlib 1.1.0
  • IPython 0.12
  • pyzmq
  • tornado
  • pygments
  • python-dateutil 1.5
  • pandas 0.7.0

Verifying your IPython Notebook setup is working:
ipython notebook –pylab=inline

Read more:

Example of OSINT analysis platform with link analysis, NLP and NER

Law Enforcement databases

  • Nexus peering
  • IBIS
  • CIC
  • NCIC
  • AFIS
  • NDex
  • CPIC
  • PIRS
  • 911 databases
Government Sources
  • National offender registries
  • State and Provincial Driver’s Licences databases
  • Birth registry databases
  • Marriage registry
  • Immigration and work permits
  • 311 databases

Social Media sources

  • Facebook – 750 million users – 40 billion photos
  • Google+ – 20 million users
  • Twitter – 100 million users – 2 million tweets per day
  • Myspace – 113 million users
  • Bebo – 12.6 million users
  • Linkedin – 70 million users
  • Friendster – 115 million users
  • YouTube – Comments
  • Forums and Boards
  • Blogs and personal websites
  • Google search results
  • Twitter archives
  • Flickr
  • Picasa
News agency sources
  • LexisNexis
  • Bloomberg
  • Reuters
Traditional news sources
  • NYT
  • BBC
  • Time Magazine
  • Globe and Mail
  • Washington Post
  • Aljazeera
Online Dictionaries and Wikis
  • Wikipedia
  • Dictionary

OSINT includes a wide variety of information and sources

  • Media: newspapers, magazines, radio, television, and computer-based information.
  • Web-based communities and user-generated content: social-networking sites, video sharing sites, wikis, blogs, and folksonomies.
  • Public data: government reports, official data such as budgets, demographics, hearings, legislative debates, press conferences, speeches, marine and aeronautical safety warnings, environmental impact statements and contract awards.
  • Observation and reporting: amateur airplane spotters, radio monitors and satellite observers among many others have provided significant information not otherwise available. The availability of worldwide satellite photography, often of high resolution, on the Web (e.g., Google Earth) has expanded open-source capabilities into areas formerly available only to major intelligence services.
  • Professional and academic: conferences, symposia, professional associations, academic papers, and subject matter experts.[1]
  • Most information has geospatial dimensions, but many often overlook the geospatial side of OSINT: not all open-source data is unstructured text. Examples of geospatial open source include hard and softcopy maps, atlases, gazetteers, port plans, gravity data, aeronautical data, navigation data, geodetic data, human terrain data (cultural and economic), environmental data, commercial imagery, LIDAR, hyper and multi-spectral data, airborne imagery, geo-names, geo-features, urban terrain, vertical obstruction data, boundary marker data, geospatial mashupsspatial databases, and web services. Most of the geospatial data mentioned above is integrated, analyzed, and syndicated using geospatial software like a Geographic Information System (GIS) not a browser per se.
Ingest Listing for OSINT

Maltego Twitter Analysis of @DGleebits

Visualization: Social Network Analysis @dgleebits Twitter with NodeXL

This distributed forensics thing is going to change Digital Forensics and Incident Response – DFIR

Distributed forensics and incident response in the enterprise


Remote live forensics has recently been increasingly used in order to facilitate rapid remote access to enterprise machines. We present the GRR Rapid Response Framework (GRR), a new multi-platform, open source tool for enterprise forensic investigations enabling remote raw disk and memory access. GRR is designed to be scalable, opening the door for continuous enterprise wide forensic analysis. This paper describes the architecture used by GRR and illustrates how it is used routinely to expedite enterprise forensic investigations.


Installing GRR

To install GRR you’ll need to set up a server, which runs the front-end HTTP server, enroller, workers and administration UI.

For this proof-of-concept they are installed on a single server, but a more scalable approach would be to run them on individual servers.

Installing the GRR server

To install the GRR server see ServerInstall

Installing the GRR clients

The GRR clients are best deployed as stand alone pre-packaged binaries. These are dependent on the Operating System of the client system

To create a GRR Windows client binary see BuildingWindowsClient

To create a GRR MacOs-X client binary see BuildingOSXClient

The Linux client currently is not provided as a binary, but instructions on how to run a test/development version are included in the server installation documentation.

Screenshot from 2013-11-18 18-36-46 Screenshot from 2013-11-18 18-36-13


Get every new post delivered to your Inbox.

Join 32 other followers