BigSnarf blog

Infosec FTW

Category Archives: Tools

Using Spark to do real-time large scale log analysis

 

 Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea. 

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf

In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO.

Screen Shot 2014-03-26 at 8.56.29 PM

http://nbviewer.ipython.org/urls/raw.githubusercontent.com/bigsnarfdude/PythonSystemAdminTools/master/auth_log_analysis_spark.ipynb?create=1

These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”,
“Accepted publickey”,
“session opened”
Failed user login “authentication failure”,
“failed password”
User log-off “session closed”
User account change or deletion “password changed”,
“new user”,
“delete user”
Sudo actions “sudo: … COMMAND=…”
“FAILED su”
Service failure “failed” or “failure”

 

 

 

Note that ip address 219.192.113.91 is making a ton of requests

invalid

 

Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500

 

 

 

Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “219.150.161.20″ ip address.

suspicious

spark-streaming21

Screen Shot 2014-03-28 at 12.52.46 PM

Links

 

 

Setting up Spark on your Macbook videos

PySpark analysis Apache Access Log

Screen Shot 2014-03-19 at 3.04.44 PM
# read in hdfs file to Spark Context object and cache
logs = sc.textFile('hdfs:///big-access-log').cache()

# create filters
errors500 = logs.filter(lambda logline: "500" in logline)
errors404 = logs.filter(lambda logline: "404" in logline)
errors200 = logs.filter(lambda logline: "200" in logline)
# grab counts
e500_count = errors500.count()
e404_count = errors404.count()
e200_count = errors200.count()
# bring the results back to this box locally
local_500 = errors500.take(e500_count)
local_404 = errors404.take(e404_count)
local_200 = errors200.take(e200_count)

def make_ip_list(iterable):
    m = []
    for line in iterable:
        m.append(line.split()[0])
    return m
def list_count(iterable):
    d = {}
    for i in iterable:
        if i in d:
            d[i] = d[i] + 1
        else:
            d[i] = 1
    return d
# results of people making 500, 404, and 200 requests for the dataset
ip_addresses_making_500_requests = list_count(make_ip_list(local_500))
ip_addresses_making_404_requests = list_count(make_ip_list(local_404))
ip_addresses_making_200_requests = list_count(make_ip_list(local_200))

AOL Moloch is PCAP Elasticsearch full packet search

moloch-stats

https://github.com/bigsnarfdude/moloch

Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly. Simple security is implemented by using HTTPS and HTTP digest password support or by using apache in front. Moloch is not meant to replace IDS engines but instead work along side them to store and index all the network traffic in standard PCAP format, providing fast access. Moloch is built to be deployed across many systems and can scale to handle multiple gigabits/sec of traffic.

Installation is pretty simple for a POC

  1. Spin up an Ubuntu box
  2. Update all the packages
  3. git clone https://github.com/bigsnarfdude/moloch
  4. follow tutorial if you must http://blog.alejandronolla.com/2013/04/06/moloch-capturing-and-indexing-network-traffic-in-realtime
  5. cd moloch
  6. ./easybutton-singlehost.sh
  7. follow prompts
  8. load sample PCAPs from http://digitalcorpora.org/corp/nps/scenarios/2009-m57-patents/net
  9. Have fun with Moloch

NSRL and Mandiant MD5 in python Bloom Filters

Arduino Sensor, Python, and Google Analytics

_DSC0333

Screen Shot 2013-12-21 at 8.14.56 PM

Screen Shot 2013-12-21 at 8.19.16 PM

import serial
import time
import urllib2
import urllib
import httplib
ser = serial.Serial('/dev/tty.usbserial-AM01VDMD')
print ( "connected to: " + ser.portstr )
buf = []
while True:
 for line in ser.read():
 buf.append(line)
 if line == "\n":
 result = "".join(buf).strip()
 print result
connection = httplib.HTTPConnection('www.google-analytics.com')
 params = urllib.urlencode({
 'v': 1,
 'tid': 'UA-46669546-1',
 'cid': '555',
 't': 'event',
 'ec': 'arduino',
 'ea': 'ldr',
 'ev': result
 })
 connection.request('POST', '/collect', params)
 print "Posted to GA"
 print params
 buf=[]
ser.close()
"""
const int ledPin = 13;
const int sensorPin = 0;
void setup() {
 pinMode(ledPin, OUTPUT);
 Serial.begin(9600);
}
void loop() {
 int rate = analogRead(A0);
 digitalWrite(ledPin, HIGH); 
 delay(rate); 

 digitalWrite(ledPin, LOW); 
 delay(rate);

 Serial.println(rate);
 delay(500); //slow down the output for easier reading
}
"""

Motitvation http://www.forbes.com/sites/ericsavitz/2013/01/14/ces-2013-the-break-out-year-for-the-internet-of-things/

Itertools Recipes – Python Docs – So helpful

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

def enumerate(iterable, start=0):
    return izip(count(start), iterable)

def tabulate(function, start=0):
    "Return function(0), function(1), ..."
    return imap(function, count(start))

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # The technique uses objects that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the emtpy slice starting at position n
        next(islice(iterator, n, n), None)

def nth(iterable, n, default=None):
    "Returns the nth item or a default value"
    return next(islice(iterable, n, None), default)

def quantify(iterable, pred=bool):
    "Count how many times the predicate is true"
    return sum(imap(pred, iterable))

def padnone(iterable):
    """Returns the sequence elements and then returns None indefinitely.

    Useful for emulating the behavior of the built-in map() function.
    """
    return chain(iterable, repeat(None))

def ncycles(iterable, n):
    "Returns the sequence elements n times"
    return chain.from_iterable(repeat(iterable, n))

def dotproduct(vec1, vec2):
    return sum(imap(operator.mul, vec1, vec2))

def flatten(listOfLists):
    return list(chain.from_iterable(listOfLists))

def repeatfunc(func, times=None, *args):
    """Repeat calls to func with specified arguments.

    Example:  repeatfunc(random.random)
    """
    if times is None:
        return starmap(func, repeat(args))
    return starmap(func, repeat(args, times))

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    pending = len(iterables)
    nexts = cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            pending -= 1
            nexts = cycle(islice(nexts, pending))

def compress(data, selectors):
    "compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F"
    return (d for d, s in izip(data, selectors) if s)

def combinations_with_replacement(iterable, r):
    "combinations_with_replacement('ABC', 2) --> AA AB AC BB BC CC"
    # number items returned:  (n+r-1)! / r! / (n-1)!
    pool = tuple(iterable)
    n = len(pool)
    if not n and r:
        return
    indices = [0] * r
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != n - 1:
                break
        else:
            return
        indices[i:] = [indices[i] + 1] * (r - i)
        yield tuple(pool[i] for i in indices)

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in iterable:
            if element not in seen:
                seen_add(element)
                yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return imap(next, imap(itemgetter(1), groupby(iterable, key)))

Mavericks OSX you broke my pydata stack

Screen Shot 2013-10-29 at 3.20.58 PM

Screen Shot 2013-10-29 at 3.32.11 PM
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O – | sudo python
sudo python ez_setup.py
sudo easy_install pip
sudo pip install virtualenv
sudo pip install virtualenvwrapper
curl -O http://python-distribute.org/distribute_setup.py
sudo python distribute_setup.py
sudo pip install scipy –upgrade
sudo pip install numpy –upgrade
sudo pip install matplotlib –upgrade
sudo pip install pyzmq –upgrade
sudo pip install tornado –upgrade
sudo pip install pygments –upgrade
sudo pip install pandas –upgrade
sudo pip install jinja2 –upgrade

Process weblogs with Hadoop and Excel

Tutorial 10: How to Visualize Website Clickstream Data

http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/

Nice graphs built on d3.js

Follow

Get every new post delivered to your Inbox.

Join 32 other followers