Do you know what machines are compromised on your network? Are you missing data? How confident are you? The amount of data in Information Security is growing so fast, traditional database systems will have difficulty scaling. As of March 2012, there are only a few open source big data solutions and recipes published for the security industry. As of February 2013, there are a couple of commercial options for big data analytics geared towards the Information Security industry. As of Oct 2014 you can build your own system easily with PCAP -> Logs -> Kafka -> Spark SQL.
This blog is a collection of ideas, tools, and framework to leverage all the goodness of open source solutions. The first Bigsnarf framework was built on Hadoop MapReduce/Hive. The second iteration of the Bigsnarf framework leverages Lambda architecture and the Apache Spark platform, Amazon Kinesis, Amazon S3, IPython Notebook and Tableau.
You will also find stream processing, in memory processing, parallel processing with clusters, and predictive analytics. I also look at instrumentation of applications and log processing.
With this information in this blog, organizations can analyze mountains of data. The ability to analyze large InfoSec datasets will become a key basis of advantage for Information Security groups. Enriching data sources with geolocatoin, blacklist and whitelist data helps provide situational awareness. Bigsnarf is open source Security Investigation Analytics.
Closing the gap between compromise and resolution underpins new methods of innovation and investigation for DFIR and Infosec. Every organization will have to grapple with the implications of big data. BigSnarf can:
- help organizations understand steps required to set up a security data analytics team
- help organizations manage the increase the volume and detail of data captured by enterprises
- index and store full context network PCAPs
- index and store network log data
- index and store individual log data
- index and store RAM images of resident memory snapshots on individual machines
- index and store MD5 hashes of all files on HDD of individual machines
- index and store snapshots of all running processes of individual machines
- analytical models of trusted user traffic and behaviour
- analytical models of untrusted user and machine traffic and behaviour
- use of machine learning to detect and identify anomalous behaviour of untrusted traffic
- use machine learning to cluster malware, identify who handled stolen data, identify graph of connected systems
- use machine learning, fuzzy hashing, and massive datastore for “finding needles in haystacks”
Essentially: play, record, pause, forward, rewind and review full context history of everything going to, from, and running on individual machines in the network, and outside the network. Collection of data. Indexing, storage and processing of data. Real-time search of data. Data analytics. Predictive analytics.
The biggest advantage with all these systems will be DATA ENRICHMENT. Feeding and combining data to turn a weak signal into actionable insights.
Top Read Posts:
Update 2013 http://www.networkworld.com/community/blog/big-data-security-challenges
Update 2014 Great Security Data Science Blogs and IPython Notebooks