BigSnarf blog

Infosec FTW

Category Archives: Framework

d3.js mixedtape tutorials – creators gotta create

Bulk processing memory, network traces and HDD using fuzzy hashing and sdhash

Cloudera Impala for Real Time Queries in Hadoop

Machine Learning – LinkedIn profile matcher based on Skills tags

Screen Shot 2013-01-03 at 10.45.58 AM

Linkedin Profiles 4,2, and 1 matched to ‘jQuery’ etc. tags.

Linkedin Profiles 5 and 4 matched to ‘Data Analysis’ etc. tags

https://github.com/bigsnarfdude/machineLearning/tree/master/linkedin

Here is definitely something that will be part of the bigsnarf technology stack

 

iPython Notebook pandas data analysis of web logs and auth logs

Get code here:

https://github.com/dgleebits/PythonSystemAdminTools/blob/master/pandasAuthLogAnalysis.ipynb

Get sample attack data set here:

http://honeynet.org/files/sanitized_log.zip

Thanks to Vincent for testing the code and helping out with the screenshots.

Influences

http://pixlcloud.com/

Using pandas to report on apache web logs

So I got this new book:

Step 1 – Start with this Forensic Challenge dataset:

http://honeynet.org/files/sanitized_log.zip

Step 2 – Build program without pandas:

#! /usr/bin/python
”’
This program takes in a apache www-media.log and provides basic report
”’
for collections import Counters
ipAddressList = []
methodList = []
requestedList = []
referalList = []
mylist = []
data = open(‘www-media.log’).readlines()
for line in data:
     ipAddressList.append(line.split()[0])
     requestedList.append(line.split()[6])
    methodList.append(line.split()[5])
    referalList.append(line.split()[10])
count_ip = Counter(ipAddressList)
count_requested = Counter(requestedList)
count_method = Counter(methodList)
count_referal = Counter(referalList)
count_ip.most_common()
count_requested.most_common()
count_method.most_common()
count_referal.most_common()

Step 3 – Build program with pandas … code is very simple and easy once you figure out how the DataFrame works

import pandas
data = open(‘www-media.log’).readlines()
frame = pandas.DataFrame([x.split() for x in data])
countIP = frame[0].value_counts()
countRequested = frame[6].value_counts()
countReferal = frame[10].value_counts()
print countIP
print countRequested
print countReferal

Step 4 – Enjoy Responsibly

Step 5 – Get code here

 https://github.com/dgleebits/PythonSystemAdminTools/blob/master/weblogAnalysis.py

Using association rules to build patterns of attacks and normal traffic

In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern  {onions -> potatoes => burgers}.  These items or patterns, were found in the sales data of a supermarket. Analysis indicates,  that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement.  Stores can also put onions on sale hoping to get more sales in potatoes and burgers.

So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?

Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions.  Finding an attacker on your network becomes overwhelming.  I have more questions than answers at this point.

Introducing Apache Mahout – Association Rule Algorithm:

Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper: http://infolab.stanford.edu/~echang/recsys08-69.pdf

Another possible solution for an open source reliable association rules algorithm:

RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010.  It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source: http://en.wikipedia.org/wiki/RapidMiner

Examples of data sets:

Solutions for BigData ingest of RAM and HDD Files: Analyzing resident memory and processing hard drive files with Hadoop

To process HDD, the solution I found was: The Sleuth Kit Hadoop Framework is a framework that incorporates TSK into cloud computing for large scale data analysis. https://github.com/sleuthkit/hadoop_framework

To image and process RAM files https://sites.google.com/site/grrresponserig/documentation/

I also found these papers:

http://dfrws.org/2009/proceedings/p34-ayers.pdf

http://cs.uno.edu/~golden/Stuff/mmr-ifip-09.pdf

This distributed forensics thing is going to change Digital Forensics and Incident Response – DFIR

Distributed forensics and incident response in the enterprise

Abstract

Remote live forensics has recently been increasingly used in order to facilitate rapid remote access to enterprise machines. We present the GRR Rapid Response Framework (GRR), a new multi-platform, open source tool for enterprise forensic investigations enabling remote raw disk and memory access. GRR is designed to be scalable, opening the door for continuous enterprise wide forensic analysis. This paper describes the architecture used by GRR and illustrates how it is used routinely to expedite enterprise forensic investigations.

***********************************************************

Installing GRR

To install GRR you’ll need to set up a server, which runs the front-end HTTP server, enroller, workers and administration UI.

For this proof-of-concept they are installed on a single server, but a more scalable approach would be to run them on individual servers.

Installing the GRR server

To install the GRR server see ServerInstall

Installing the GRR clients

The GRR clients are best deployed as stand alone pre-packaged binaries. These are dependent on the Operating System of the client system

To create a GRR Windows client binary see BuildingWindowsClient

To create a GRR MacOs-X client binary see BuildingOSXClient

The Linux client currently is not provided as a binary, but instructions on how to run a test/development version are included in the server installation documentation.

Follow

Get every new post delivered to your Inbox.