BigSnarf blog
Infosec FTW
Category Archives: Framework
d3.js mixedtape tutorials – creators gotta create
Posted by on March 19, 2013
Bulk processing memory, network traces and HDD using fuzzy hashing and sdhash
Posted by on March 12, 2013
Cloudera Impala for Real Time Queries in Hadoop
Posted by on February 3, 2013
Machine Learning – LinkedIn profile matcher based on Skills tags
Posted by on January 3, 2013
Linkedin Profiles 4,2, and 1 matched to ‘jQuery’ etc. tags.
Linkedin Profiles 5 and 4 matched to ‘Data Analysis’ etc. tags
https://github.com/bigsnarfdude/machineLearning/tree/master/linkedin
Here is definitely something that will be part of the bigsnarf technology stack
Posted by on October 15, 2012
iPython Notebook pandas data analysis of web logs and auth logs
Posted by on May 28, 2012
Get code here:
https://github.com/dgleebits/PythonSystemAdminTools/blob/master/pandasAuthLogAnalysis.ipynb
Get sample attack data set here:
http://honeynet.org/files/sanitized_log.zip
Thanks to Vincent for testing the code and helping out with the screenshots.
Influences
Using pandas to report on apache web logs
Posted by on May 28, 2012
So I got this new book:
Step 1 – Start with this Forensic Challenge dataset:
http://honeynet.org/files/sanitized_log.zip
Step 2 – Build program without pandas:
#! /usr/bin/python”’This program takes in a apache www-media.log and provides basic report”’for collections import CountersipAddressList = []methodList = []requestedList = []referalList = []mylist = []data = open(‘www-media.log’).readlines()for line in data:ipAddressList.append(line.split()[0])requestedList.append(line.split()[6])methodList.append(line.split()[5])referalList.append(line.split()[10])count_ip = Counter(ipAddressList)count_requested = Counter(requestedList)count_method = Counter(methodList)count_referal = Counter(referalList)count_ip.most_common()count_requested.most_common()count_method.most_common()count_referal.most_common()
Step 3 – Build program with pandas … code is very simple and easy once you figure out how the DataFrame works
import pandasdata = open(‘www-media.log’).readlines()frame = pandas.DataFrame([x.split() for x in data])countIP = frame[0].value_counts()countRequested = frame[6].value_counts()countReferal = frame[10].value_counts()print countIPprint countRequestedprint countReferal
Step 4 – Enjoy Responsibly
Step 5 – Get code here
https://github.com/dgleebits/PythonSystemAdminTools/blob/master/weblogAnalysis.py
Using association rules to build patterns of attacks and normal traffic
Posted by on March 30, 2012
In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern {onions -> potatoes => burgers}. These items or patterns, were found in the sales data of a supermarket. Analysis indicates, that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement. Stores can also put onions on sale hoping to get more sales in potatoes and burgers.
So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?
Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions. Finding an attacker on your network becomes overwhelming. I have more questions than answers at this point.
Introducing Apache Mahout – Association Rule Algorithm:
Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper: http://infolab.stanford.edu/~echang/recsys08-69.pdf
Another possible solution for an open source reliable association rules algorithm:
RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010. It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source: http://en.wikipedia.org/wiki/RapidMiner
Examples of data sets:
Solutions for BigData ingest of RAM and HDD Files: Analyzing resident memory and processing hard drive files with Hadoop
Posted by on March 28, 2012
To process HDD, the solution I found was: The Sleuth Kit Hadoop Framework is a framework that incorporates TSK into cloud computing for large scale data analysis. https://github.com/sleuthkit/hadoop_framework
To image and process RAM files https://sites.google.com/site/grrresponserig/documentation/
I also found these papers:
This distributed forensics thing is going to change Digital Forensics and Incident Response – DFIR
Posted by on March 25, 2012
Distributed forensics and incident response in the enterprise
Abstract
Remote live forensics has recently been increasingly used in order to facilitate rapid remote access to enterprise machines. We present the GRR Rapid Response Framework (GRR), a new multi-platform, open source tool for enterprise forensic investigations enabling remote raw disk and memory access. GRR is designed to be scalable, opening the door for continuous enterprise wide forensic analysis. This paper describes the architecture used by GRR and illustrates how it is used routinely to expedite enterprise forensic investigations.
***********************************************************
Installing GRR
To install GRR you’ll need to set up a server, which runs the front-end HTTP server, enroller, workers and administration UI.
For this proof-of-concept they are installed on a single server, but a more scalable approach would be to run them on individual servers.
Installing the GRR server
To install the GRR server see ServerInstall
Installing the GRR clients
The GRR clients are best deployed as stand alone pre-packaged binaries. These are dependent on the Operating System of the client system
To create a GRR Windows client binary see BuildingWindowsClient
To create a GRR MacOs-X client binary see BuildingOSXClient
The Linux client currently is not provided as a binary, but instructions on how to run a test/development version are included in the server installation documentation.



















