BigSnarf blog
Infosec FTW
Monthly Archives: March 2012
Using association rules to build patterns of attacks and normal traffic
Posted by on March 30, 2012
In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern {onions -> potatoes => burgers}. These items or patterns, were found in the sales data of a supermarket. Analysis indicates, that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement. Stores can also put onions on sale hoping to get more sales in potatoes and burgers.
So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?
Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions. Finding an attacker on your network becomes overwhelming. I have more questions than answers at this point.
Introducing Apache Mahout – Association Rule Algorithm:
Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper: http://infolab.stanford.edu/~echang/recsys08-69.pdf
Another possible solution for an open source reliable association rules algorithm:
RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010. It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source: http://en.wikipedia.org/wiki/RapidMiner
Examples of data sets:
Solutions for BigData ingest of RAM and HDD Files: Analyzing resident memory and processing hard drive files with Hadoop
Posted by on March 28, 2012
To process HDD, the solution I found was: The Sleuth Kit Hadoop Framework is a framework that incorporates TSK into cloud computing for large scale data analysis. https://github.com/sleuthkit/hadoop_framework
To image and process RAM files https://sites.google.com/site/grrresponserig/documentation/
I also found these papers:
Solutions for Big Data ingest of network traffic – Analyzing PCAP traffic with Hadoop
Posted by on March 28, 2012
Building a system that can do full context PCAP for a single machine is trivial, IMHO compared to creating predictive algorithms for analyzing PCAP traffic. There are log data search solutions like Elasticsearch, GreyLog2, ELSA, Splunk and Logstash that can help you archive and dig through the data.
My favorite network traffic big data solution (2012) is PacketPig. I found this nice setup by Alienvault. Security Onion is a great network security IDS etc distro. I have seen one called xtractr, MR for forensics. Several solutions exist and PCAP files can be fed to the engines for analysis. I think ARGUS has a place here too, but I haven’t tackled it yet.
I started using PCAP to CSV conversion perl program, and written my own sniffer to csv in scapy. Super Timelines are being done in python too. Once I get a PCAP file converted to csv, I load it up to HDFS via HUE. I also found this PCAP visualization blog entry by Raffael Marty.
I’ve stored a bunch of csv network traces and did analysis using HIVE and PIG queries. It was very simple. Name the columns and query each column looking for specific entries. Very labour intensive. Binary analysis on Hadoop.
I’m working on a MapReduce library that uses machine learning to classify attackers and their network patterns. As of 2013, there are a few commercial venders like IBM and RSA which have added Hadoop capability to their SIEM product lines. Here is Twitters logging setup.
There are a few examples of PCAP ingestion with open source tools like Hadoop:
The second presentation I found was Wayne Wheelers – SherpaSurfing and https://github.com/sherpasurfing/SHERPASURFING:
The third I found was https://github.com/RIPE-NCC/hadoop-pcap:
The fourth project I found was presented at BlackHatEU 2012 by PacketLoop and https://github.com/packetloop/packetpig:
Using WEKA classification and clustering algorithms
Posted by on March 28, 2012
In this post we will look at another sample dataset on predicting the pricing of a car, using a classification tree or decision tree algorithm. First we load the data:
In this example we are using a clustering algorithm on the data. We load the data:
WEKA created 4clusters:
Read more: http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html?
Using WEKA for data mining and predictive learning
Posted by on March 27, 2012
What is data mining? I didn’t know and it sounded like secret magic stuff. After digging, mind the pun, I found out it’s alot of advanced math beyond simple addition and subtraction. The math is around algorithms that do different things. I found that there is a deep science to working in the data mining and predictive learning field. It is different from Artificial Intelligence and Statistics. In the past few months I have learned from online resources like STATS202 course. I learned about Machine Learning from Andrew Ng class. I learned about AI from Introduction to Artificial Intelligence. I learned about munging data with Pandas Python. I learned how to rip Twitter and weblogs with R. I learned about using Pentaho PDI and Hadoop.
I’m now learning about WEKA and data mining and munging data to get into the ARFF format for predictive analysis. The secret magic stuff is gone, and I realize that getting the data into the tool of your choice is hard part. Getting accurate results from your tools is hard stuff. Finding case studies and examples for Infosec is also hard to do. The goal of data mining is to create a model that can help you interpret your data. Add visual analytics and you have the recipe to process your data and gain some insight to action.
Using ARFF sample data set for simple linear regression
We have created an algorithm for predicting housing prices based on the sample dataset:
Linear Regression Model
sellingPrice =
-26.6882 * houseSize +
7.0551 * lotSize +
43166.0767 * bedrooms +
42292.0901 * bathroom +
-21661.1208
Basically plugging in the 4 variables like houseSize, lotSize, bedrooms and bathroom we can calculate sellingPrice.
Read more: http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html
Installing Pandas, iPython Notebook on Backtrack5
Posted by on March 25, 2012
Required packages
- curl
- pip
- Python 2.7.2
- Distribute 0.6.25
- NumPy 1.6.1
- SciPy 0.9
- matplotlib 1.1.0
- IPython 0.12
- pyzmq
- tornado
- pygments
- python-dateutil 1.5
- pandas 0.7.0
Verifying your IPython Notebook setup is working:
ipython notebook –pylab=inline
Read more: http://pandas.sourceforge.net/pandas.pdf
Example of OSINT analysis platform with link analysis, NLP and NER
Posted by on March 25, 2012
Law Enforcement databases
- Nexus peering
- FBI ACS
- IBIS
- CIC
- NCIC
- AFIS
- NDex
- NYPD IDS
- CPIC
- PIRS
- PRIME
- 911 databases
Government Sources
- National offender registries
- State and Provincial Driver’s Licences databases
- Birth registry databases
- Marriage registry
- Immigration and work permits
- 311 databases
Social Media sources
- Facebook – 750 million users – 40 billion photos
- Google+ – 20 million users
- Twitter – 100 million users – 2 million tweets per day
- Myspace – 113 million users
- Bebo – 12.6 million users
- Linkedin – 70 million users
- Friendster – 115 million users
- YouTube – Comments
- Forums and Boards
- Blogs and personal websites
- Google search results
- Twitter archives
- Flickr
- Picasa
News agency sources
- LexisNexis
- Bloomberg
- Reuters
Traditional news sources
- NYT
- BBC
- Time Magazine
- Globe and Mail
- Washington Post
- Aljazeera
Online Dictionaries and Wikis
- Wikipedia
- Dictionary
OSINT includes a wide variety of information and sources
- Media: newspapers, magazines, radio, television, and computer-based information.
- Web-based communities and user-generated content: social-networking sites, video sharing sites, wikis, blogs, and folksonomies.
- Public data: government reports, official data such as budgets, demographics, hearings, legislative debates, press conferences, speeches, marine and aeronautical safety warnings, environmental impact statements and contract awards.
- Observation and reporting: amateur airplane spotters, radio monitors and satellite observers among many others have provided significant information not otherwise available. The availability of worldwide satellite photography, often of high resolution, on the Web (e.g., Google Earth) has expanded open-source capabilities into areas formerly available only to major intelligence services.
- Professional and academic: conferences, symposia, professional associations, academic papers, and subject matter experts.[1]
- Most information has geospatial dimensions, but many often overlook the geospatial side of OSINT: not all open-source data is unstructured text. Examples of geospatial open source include hard and softcopy maps, atlases, gazetteers, port plans, gravity data, aeronautical data, navigation data, geodetic data, human terrain data (cultural and economic), environmental data, commercial imagery, LIDAR, hyper and multi-spectral data, airborne imagery, geo-names, geo-features, urban terrain, vertical obstruction data, boundary marker data, geospatial mashups, spatial databases, and web services. Most of the geospatial data mentioned above is integrated, analyzed, and syndicated using geospatial software like a Geographic Information System (GIS) not a browser per se.
Ingest Listing for OSINT
This distributed forensics thing is going to change Digital Forensics and Incident Response – DFIR
Posted by on March 25, 2012
Distributed forensics and incident response in the enterprise
Abstract
Remote live forensics has recently been increasingly used in order to facilitate rapid remote access to enterprise machines. We present the GRR Rapid Response Framework (GRR), a new multi-platform, open source tool for enterprise forensic investigations enabling remote raw disk and memory access. GRR is designed to be scalable, opening the door for continuous enterprise wide forensic analysis. This paper describes the architecture used by GRR and illustrates how it is used routinely to expedite enterprise forensic investigations.
***********************************************************
Installing GRR
To install GRR you’ll need to set up a server, which runs the front-end HTTP server, enroller, workers and administration UI.
For this proof-of-concept they are installed on a single server, but a more scalable approach would be to run them on individual servers.
Installing the GRR server
To install the GRR server see ServerInstall
Installing the GRR clients
The GRR clients are best deployed as stand alone pre-packaged binaries. These are dependent on the Operating System of the client system
To create a GRR Windows client binary see BuildingWindowsClient
To create a GRR MacOs-X client binary see BuildingOSXClient
The Linux client currently is not provided as a binary, but instructions on how to run a test/development version are included in the server installation documentation.






















