BigSnarf blog

Infosec FTW

Solutions for Big Data ingest of network traffic – Analyzing PCAP traffic with Hadoop

Building a system that can do full context PCAP for a single machine is trivial, IMHO compared to creating predictive algorithms for analyzing PCAP traffic.

Interesting read in 2015: http://blogs.cisco.com/security/talos/machine-learning-detectors

There are log data search solutions like Elasticsearch, GreyLog2, ELSA, Splunk, Red Lambda, and Logstash that can help you archive and dig through the data.

My favorite big data security analytics solution in 2012) was PacketPig (now Prevail). In 2014 I noticed another player named PacketSled. I found this nice setup by Alienvault. Security Onion, BRO IDS is a great network security IDS etc distro. I have seen one called xtractr, MR for forensics. Several solutions exist and PCAP files can be fed to the engines for analysis. I think PacketBeatARGUS, NTOP, MozDef and Moloch(PCAP Elasticsearch) have a place here too, but I haven’t tackled it yet. There’s a DNS Hadoop presentation from Endgame clairvoyant-squirrel.

I started using PCAP to CSV conversion perl program, and written my own sniffer to csv in scapy. Super Timelines are being done in python too. Once I get a PCAP file converted to csv, I load it up to HDFS via HUE. I also found this PCAP visualization blog entry by Raffael Marty.

I’ve stored a bunch of csv network traces and did analysis using HIVE and PIG queries. It was very simple. Name the columns and query each column looking for specific entries. Very labour intensive. Binary analysis on Hadoop.

I’m working on a MapReduce library that uses machine learning to classify attackers and their network patterns. As of 2013, there are a few commercial venders like IBM and RSA which have added Hadoop capability to their SIEM product lines. Here is Twitters logging setup. In 2014 I loaded all the csv attack data into CDH4 cluster with Impala query engine. I’m also looking at writing pandas dataframes to Googles Big Query. As of 2014 there are solutions on hadoop for malware analysis , forensics , DNS data mining.

I have been recently (2014) using Spark and PySpark on Hadoop to use the HDFS data system with the REPL doing real-time interactive query of datasets. I going to be integrating all of my IPython Notebooks and learning the Machine Learning library built on Spark. It’s going to be an awesome year! https://bigsnarf.wordpress.com/2014/10/22/process-logs-with-kinesis-s3-apache-spark-on-emr-amazon-rds/

There are a few examples of PCAP ingestion with open source tools like Hadoop:

First one I found was P3:

The second presentation I found was Wayne Wheelers – SherpaSurfing and https://github.com/sherpasurfing/SHERPASURFING:

The third I found was https://github.com/RIPE-NCC/hadoop-pcap:

The fourth project was presented at BlackHatEU 2012 by PacketLoop and https://github.com/packetloop/packetpig:

Screen Shot 2012-11-30 at 11.15.22 AM

5 responses to “Solutions for Big Data ingest of network traffic – Analyzing PCAP traffic with Hadoop

  1. Pingback: Cloudera Impala for Real Time Queries in Hadoop « BigSnarf blog

  2. batata March 18, 2015 at 7:00 am

    Hi, SecurityDude. I have one question for you. How to manage large PCAP file using Wireshark tool ? Like .. I have collected more then 3 GB + data into PCAP file now i’m only interested in some packets but when I’m talking about 3GB + so there are lots of packets and very difficult to manage… So many times I have used PCAP2XML and converted all my PCAP file data into XML and SQLite format and using browser for analysis. Have a look http://bit.ly/1DxcncQ and please let me know about if you know any other sources.

    Thanks!

    • Security Dude March 18, 2015 at 9:20 pm

      $ tshark -r test.pcap -T fields -e frame.number -e eth.src -e eth.dst -e ip.src -e ip.dst -e frame.len > test1.csv
      $ tshark -r test.pcap -T fields -e frame.number -e eth.src -e eth.dst -e ip.src -e ip.dst -e frame.len -E header=y -E separator=, > test2.csv
      $ tshark -r test.pcap -R “frame.number>40” -T fields -e frame.number -e frame.time -e frame.time_delta -e frame.time_delta_displayed -e frame.time_relative -E header=y > test3.csv
      $ tshark -r test.pcap -R “wlan.fc.type_subtype == 0x08” -T fields -e frame.number -e wlan.sa -e wlan.bssid > test4.csv

      $ tshark -r test.pcap -R “ip.addr==192.168.1.6 && tcp.port==1696 && ip.addr==67.212.143.22 && tcp.port==80” -T fields -e frame.number -e tcp.analysis.ack_rtt -E header=y > test5.csv

      $ tshark -r test.pcap -T fields -e frame.number -e tcp.analysis.ack_rtt -E header=y > test6.csv

  3. Batata March 19, 2015 at 6:44 am

    Try doing this for every frame / every type and subtype / then create a DB
    schema to store in … welcome pcap2xml n sqlite 🙂

Leave a comment