BigSnarf blog

Infosec FTW

Category Archives: Framework

Cloudera Impala for Real Time Queries in Hadoop

Machine Learning – LinkedIn profile matcher based on Skills tags

Screen Shot 2013-01-03 at 10.45.58 AM

Linkedin Profiles 4,2, and 1 matched to ‘jQuery’ etc. tags.

Linkedin Profiles 5 and 4 matched to ‘Data Analysis’ etc. tags

Here is definitely something that will be part of the bigsnarf technology stack


iPython Notebook pandas data analysis of web logs and auth logs

Get code here:

Get sample attack data set here:

Thanks to Vincent for testing the code and helping out with the screenshots.


Using pandas to report on apache web logs

So I got this new book:

Step 1 – Start with this Forensic Challenge dataset:

Step 2 – Build program without pandas:

#! /usr/bin/python
This program takes in a apache www-media.log and provides basic report
for collections import Counters
ipAddressList = []
methodList = []
requestedList = []
referalList = []
mylist = []
data = open(‘www-media.log’).readlines()
for line in data:
count_ip = Counter(ipAddressList)
count_requested = Counter(requestedList)
count_method = Counter(methodList)
count_referal = Counter(referalList)

Step 3 – Build program with pandas … code is very simple and easy once you figure out how the DataFrame works

import pandas
data = open(‘www-media.log’).readlines()
frame = pandas.DataFrame([x.split() for x in data])
countIP = frame[0].value_counts()
countRequested = frame[6].value_counts()
countReferal = frame[10].value_counts()
print countIP
print countRequested
print countReferal

Step 4 – Enjoy Responsibly

Step 5 – Get code here

Using association rules to build patterns of attacks and normal traffic

In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern  {onions -> potatoes => burgers}.  These items or patterns, were found in the sales data of a supermarket. Analysis indicates,  that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement.  Stores can also put onions on sale hoping to get more sales in potatoes and burgers.

So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?

Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions.  Finding an attacker on your network becomes overwhelming.  I have more questions than answers at this point.

Introducing Apache Mahout – Association Rule Algorithm:

Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper:

Another possible solution for an open source reliable association rules algorithm:

RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010.  It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source:

Examples of data sets:

Solutions for BigData ingest of RAM and HDD Files: Analyzing resident memory and processing hard drive files with Hadoop

To process HDD, the solution I found was: The Sleuth Kit Hadoop Framework is a framework that incorporates TSK into cloud computing for large scale data analysis.

To image and process RAM files

I also found these papers:

This distributed forensics thing is going to change Digital Forensics and Incident Response – GRR DFIR

Distributed forensics and incident response in the enterprise


Remote live forensics has recently been increasingly used in order to facilitate rapid remote access to enterprise machines. We present the GRR Rapid Response Framework (GRR), a new multi-platform, open source tool for enterprise forensic investigations enabling remote raw disk and memory access. GRR is designed to be scalable, opening the door for continuous enterprise wide forensic analysis. This paper describes the architecture used by GRR and illustrates how it is used routinely to expedite enterprise forensic investigations.


Installing GRR

To install GRR you’ll need to set up a server, which runs the front-end HTTP server, enroller, workers and administration UI.

For this proof-of-concept they are installed on a single server, but a more scalable approach would be to run them on individual servers.

Installing the GRR server

To install the GRR server see ServerInstall

Installing the GRR clients

The GRR clients are best deployed as stand alone pre-packaged binaries. These are dependent on the Operating System of the client system

To create a GRR Windows client binary see BuildingWindowsClient

To create a GRR MacOs-X client binary see BuildingOSXClient

The Linux client currently is not provided as a binary, but instructions on how to run a test/development version are included in the server installation documentation.

Screenshot from 2013-11-18 18-36-46 Screenshot from 2013-11-18 18-36-13

Big Data Infosec – Bigsnarf Open Source Solution

To start the conversation off on my Big Data Infosec journey, I created this placemat to consider where Infosec might end up. This is an example of my first experiments with visualizing data with Hadoop and Hive. It is on the same idea as packetpig. (Link: PDF). I think that Big Data, gives Information Security, a second chance to “do it better”.

Last year, while building my POC, I found Wayne’s SherpaSurfing solution. (Link:Slideshare) and though that Big Data and Infosec could be better. In Ben’s post (Link:Blog), a sobering discussion around are we winning or losing. Scott Crawford has been discussing data-driven security with his (Link: blog series). George followed up with his post (Link:Website) which I interpreted as “I think we’re in a precarious spot”. Here’s the RSA panel’s position on the topic (Link:Website). Andrew suggests that SIEM aren’t providing value buy rebranding with Big Data “buzzword” stickers (Link:Website). Packetloop presented their interpretation of Big Data Infosec at Blackhat EU 2012 (Link:PDF). Raffael followed up recently with his post (Link:Website). Moar visual analytics! Ed in this post suggested that Infosec stop using “stoplight reports” and using different metrics to get a situational aware (Link:Website).

Level 1 – Data Collection

  • Organizations in this level have data silos spitting out a variety of log and machine data collected from various sources and “Enterprise Security” systems. Most organization need humans to connect the silo’d data to interpret results. This in not very good position to be in because there is a severe reliance on humans processing the data.

Level 2 – Big Data Aggregation

  • Organizations in this level have some semblance of a plan and big data strategy. The organization is focused on integration activities to get the silo’d data in some sort of Data Warehouse/Hadoop/(name your flavour of big data technology here). A POC system is providing some insight/reporting that has traditionally required an Infosec analyst to produce.

Level 3 – Basic Tools of Analysis

  • Organizations have managed to stockpile months/years worth of data for a “data scientist” and “Infosec analyst” to spend time producing standard charts and reports already produced by other silo’d systems. The difference is the data mining/pattern matching focus on the complete Infosec dataset. Adhoc reports are pushed out of this group, but there is still quite a bit of hand holding to get a report generated. Hundreds of jobs are run nightly to produce the first Big Data Infosec KPI’s and metric reports.

Level 4 – Data Enrichment / ETL / Real-time Queries

  • Organizations have managed to get several big data and stats experts on staff. A mature system is in place. Migration plans are being executed for ever-greening and formal training plans for users and power users of the system. In this level, it’s now time to look at another system that focuses on taking the best of the best of the Hadoop Gen1 system and creating data specifically for real time queries, visual analytics, drill down analysis, exploration and analysis, and serious BI digging. Teams are dedicated to getting every ounce of efficiency of this system. DFIR, SOC, NOC, CISO tower, e-discovery, audit, and compliance, all routinely use this system as the “private Google search engine” to answer on demand questions, adhoc queries. Routine answers at the touch of a submit button.

Level 5 – Business Intelligence

  • Organizations have made a decision to open the data gates and formally allow users self serve, limited requests to the system. Real BI technologies provide historical, current of business operations. Reporting, dynamic reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining are typically “published” to internal company websites. Business intelligence aims to support better business decision-making.

Level 6 – Predictive Model – Attack Simulation – War Games

  • Joshua: Shall we play a game? Organizations have created thousands of models and have a solid understand of the business and priorities. The organization is planning to use predictive analytics and statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events. These reports require heavy interaction of the BI, visualization, and Infosec teams to produce real validated results.

Level 7 – What If Scenario’s by CISO

  • Organizations are confident enough to let the CISO have query access and access to the system without a team being present. This really is the “dream” dashboard that every CISO wants but never gets. S/He can self serve plausible outcomes to any question they have. True value of the Information Security organization can easily be shared, like “Tweets from an iPhone.” At this level, the CISO has a cape under his/her suit.

Level 8 – Data Democracy

  • This is where you set your data free. Free as in let any users have access to query the system. Users, administrator, outside institutions, and Interwebs are publicly to query your Infosec Big Data. At this level your Infosec organization is at a new level of transparency. Why does Infosec always guessing or investigating what’s normal or abnoral? Why is one person trying understand everyone elses data and patterns? At this level every user is empowered to participate in their own security discovery? Social Collective Security Intelligence (SCSI?). Infosec could be considered “social” at this level opposed to secretive. Users consume and routinely self serve vanity queries to “pump their own ego’s” because the have access to “life statistics”. (

Where does your organization fall into the Big Data Infosec Maturity scale? Wanna read more on building your own Predictive Analytics Engine blog series?

BigSnarf Sneek Peek – Big Data Infosec Open Source Solution

Below is brain dump analog style

Below is model of brain dump

Below is an example interface that provides: Overview, Situational Data, interactivity, search and drill down capability