BigSnarf blog

Infosec FTW

Category Archives: Tools

Using Spark to do real-time large scale log analysis


 Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea.

In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO.

Screen Shot 2014-03-26 at 8.56.29 PM

These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”,
“Accepted publickey”,
“session opened”
Failed user login “authentication failure”,
“failed password”
User log-off “session closed”
User account change or deletion “password changed”,
“new user”,
“delete user”
Sudo actions “sudo: … COMMAND=…”
Service failure “failed” or “failure”




Note that ip address is making a ton of requests



Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500




Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “″ ip address.



Screen Shot 2014-03-28 at 12.52.46 PM




Setting up Spark on your Macbook videos

PySpark analysis Apache Access Log

Screen Shot 2014-03-19 at 3.04.44 PM
# read in hdfs file to Spark Context object and cache
logs = sc.textFile('hdfs:///big-access-log').cache()

# create filters
errors500 = logs.filter(lambda logline: "500" in logline)
errors404 = logs.filter(lambda logline: "404" in logline)
errors200 = logs.filter(lambda logline: "200" in logline)
# grab counts
e500_count = errors500.count()
e404_count = errors404.count()
e200_count = errors200.count()
# bring the results back to this box locally
local_500 = errors500.take(e500_count)
local_404 = errors404.take(e404_count)
local_200 = errors200.take(e200_count)

def make_ip_list(iterable):
    m = []
    for line in iterable:
    return m
def list_count(iterable):
    d = {}
    for i in iterable:
        if i in d:
            d[i] = d[i] + 1
            d[i] = 1
    return d
# results of people making 500, 404, and 200 requests for the dataset
ip_addresses_making_500_requests = list_count(make_ip_list(local_500))
ip_addresses_making_404_requests = list_count(make_ip_list(local_404))
ip_addresses_making_200_requests = list_count(make_ip_list(local_200))

AOL Moloch is PCAP Elasticsearch full packet search


Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly. Simple security is implemented by using HTTPS and HTTP digest password support or by using apache in front. Moloch is not meant to replace IDS engines but instead work along side them to store and index all the network traffic in standard PCAP format, providing fast access. Moloch is built to be deployed across many systems and can scale to handle multiple gigabits/sec of traffic.

Installation is pretty simple for a POC

  1. Spin up an Ubuntu box
  2. Update all the packages
  3. git clone
  4. follow tutorial if you must
  5. cd moloch
  6. ./
  7. follow prompts
  8. load sample PCAPs from
  9. Have fun with Moloch

NSRL and Mandiant MD5 in python Bloom Filters

Arduino Sensor, Python, and Google Analytics


Screen Shot 2013-12-21 at 8.14.56 PM

Screen Shot 2013-12-21 at 8.19.16 PM

import serial
import time
import urllib2
import urllib
import httplib
ser = serial.Serial('/dev/tty.usbserial-AM01VDMD')
print ( "connected to: " + ser.portstr )
buf = []
while True:
 for line in
 if line == "\n":
 result = "".join(buf).strip()
 print result
connection = httplib.HTTPConnection('')
 params = urllib.urlencode({
 'v': 1,
 'tid': 'UA-46669546-1',
 'cid': '555',
 't': 'event',
 'ec': 'arduino',
 'ea': 'ldr',
 'ev': result
 connection.request('POST', '/collect', params)
 print "Posted to GA"
 print params
const int ledPin = 13;
const int sensorPin = 0;
void setup() {
 pinMode(ledPin, OUTPUT);
void loop() {
 int rate = analogRead(A0);
 digitalWrite(ledPin, HIGH); 

 digitalWrite(ledPin, LOW); 

 delay(500); //slow down the output for easier reading


Itertools Recipes – Python Docs – So helpful

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

def enumerate(iterable, start=0):
    return izip(count(start), iterable)

def tabulate(function, start=0):
    "Return function(0), function(1), ..."
    return imap(function, count(start))

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # The technique uses objects that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
        # advance to the emtpy slice starting at position n
        next(islice(iterator, n, n), None)

def nth(iterable, n, default=None):
    "Returns the nth item or a default value"
    return next(islice(iterable, n, None), default)

def quantify(iterable, pred=bool):
    "Count how many times the predicate is true"
    return sum(imap(pred, iterable))

def padnone(iterable):
    """Returns the sequence elements and then returns None indefinitely.

    Useful for emulating the behavior of the built-in map() function.
    return chain(iterable, repeat(None))

def ncycles(iterable, n):
    "Returns the sequence elements n times"
    return chain.from_iterable(repeat(iterable, n))

def dotproduct(vec1, vec2):
    return sum(imap(operator.mul, vec1, vec2))

def flatten(listOfLists):
    return list(chain.from_iterable(listOfLists))

def repeatfunc(func, times=None, *args):
    """Repeat calls to func with specified arguments.

    Example:  repeatfunc(random.random)
    if times is None:
        return starmap(func, repeat(args))
    return starmap(func, repeat(args, times))

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    pending = len(iterables)
    nexts = cycle(iter(it).next for it in iterables)
    while pending:
            for next in nexts:
                yield next()
        except StopIteration:
            pending -= 1
            nexts = cycle(islice(nexts, pending))

def compress(data, selectors):
    "compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F"
    return (d for d, s in izip(data, selectors) if s)

def combinations_with_replacement(iterable, r):
    "combinations_with_replacement('ABC', 2) --> AA AB AC BB BC CC"
    # number items returned:  (n+r-1)! / r! / (n-1)!
    pool = tuple(iterable)
    n = len(pool)
    if not n and r:
    indices = [0] * r
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != n - 1:
        indices[i:] = [indices[i] + 1] * (r - i)
        yield tuple(pool[i] for i in indices)

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in iterable:
            if element not in seen:
                yield element
        for element in iterable:
            k = key(element)
            if k not in seen:
                yield element

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return imap(next, imap(itemgetter(1), groupby(iterable, key)))

Mavericks OSX you broke my pydata stack

Screen Shot 2013-10-29 at 3.20.58 PM

Screen Shot 2013-10-29 at 3.32.11 PM
wget -O – | sudo python
sudo python
sudo easy_install pip
sudo pip install virtualenv
sudo pip install virtualenvwrapper
curl -O
sudo python
sudo pip install scipy –upgrade
sudo pip install numpy –upgrade
sudo pip install matplotlib –upgrade
sudo pip install pyzmq –upgrade
sudo pip install tornado –upgrade
sudo pip install pygments –upgrade
sudo pip install pandas –upgrade
sudo pip install jinja2 –upgrade

Process weblogs with Hadoop and Excel

Tutorial 10: How to Visualize Website Clickstream Data

Nice graphs built on d3.js


Get every new post delivered to your Inbox.

Join 32 other followers