BigSnarf blog

Infosec FTW




IPython Notebook, Apache Spark, Honeynet Forensic Challenge 10

Hello World – Analysis of StackOverflow dataset using Apache Spark Job Server

Screen Shot 2014-06-13 at 8.32.54 PM

curl --data-binary @target/scala-2.10/spark-jobserver-examples_2.10-1.0.0.jar localhost:8090/jars/sparking
curl 'localhost:8090/jars'
curl 'localhost:8090/contexts'
curl -X POST 'localhost:8090/contexts/users-context'
curl -X POST 'localhost:8090/jobs?appName=sparking&classPath=sparking.jobserver.GetOrCreateUsers&context=users-context'
curl 'localhost:8090/jobs/<insertJobNumberHere>'

Cisco enters the big data security analytics with OpenSOC

Screen Shot 2014-06-13 at 4.39.41 PM

Using machine learning for anomaly detection is not new but …

Probabilistic Programming with Scala – Hello World

Screen Shot 2014-05-24 at 4.06.20 PM

import com.cra.figaro.language.{Flip, Select} //#A 
import com.cra.figaro.library.compound.If //#A 
import com.cra.figaro.algorithm.factored.VariableElimination //#A

object HelloWorld {
 val sunnyToday = Flip(0.2) //#B 
 val greetingToday = If(sunnyToday, //#C
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#C
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#C 
 val sunnyTomorrow = If(sunnyToday, Flip(0.8), Flip(0.05)) //#D 
 val greetingTomorrow = If(sunnyTomorrow, //#E
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#E 
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#E

 def predict() {
 val algorithm = VariableElimination(greetingToday) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingToday, "Hello world!") //#H 
 println("Tomorrow's greeting is \"Hello world!\" " +
 "with probability " + result + ".") //#I } algorithm.kill() //#J

 def infer() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(sunnyToday) //#F 
 algorithm.start() //#G 
 val result = algorithm.probability(sunnyToday, true) //#H 
 println("If today's greeting is \"Hello world!\", today’s " +
 "weather is sunny with probability " + result + ".") //#I 
 algorithm.kill() //#I
 greetingToday.unobserve() //#L

 def learnAndPredict() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(greetingTomorrow) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingTomorrow, "Hello world!") //#H 
 println("If today's greeting is \"Hello world!\", " +
 "tomorrow's greeting will be \"Hello world!\" " +
 "with probability " + result + ".") //#I 
 algorithm.kill() //#J 
 greetingToday.unobserve() //#L

 def main(args: Array[String]) { //#M 
 predict() //#M 
 infer() //#M
 learnAndPredict() //#M

Apache Spark Job Server – Getting Started Hello World

Screen Shot 2014-05-23 at 7.10.27 AM

Clone Ooyala’s Spark Job Server

$ git clone
$ cd spark-jobserver

Using SBT, publish it to your local repository and run it

$ sbt publish-local
$ sbt
> re-start

WordCountExample walk-through

First, to package the test jar containing the WordCountExample:

sbt job-server-tests/package

Then go ahead and start the job server using the instructions above.

Let’s upload the jar:

curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test

The above jar is uploaded as app test. Next, let’s start an ad-hoc word count job, meaning that the job server will create its own SparkContext, and return a job ID for subsequent querying:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
  "status": "STARTED",
  "result": {
    "jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
    "context": "b7ea0eb5-spark.jobserver.WordCountExample"

Algebird for Infosec Analytics

Using Spark to do real-time large scale log analysis —> unified logging

Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea. In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO. In other posts you I will document pySpark batch, and DStream algorithms in Scala processing of auth.log Screen Shot 2014-03-26 at 8.56.29 PM These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”, “Accepted publickey”, “session opened”
Failed user login “authentication failure”, “failed password”
User log-off “session closed”
User account change or deletion “password changed”, “new user”, “delete user”
Sudo actions “sudo: … COMMAND=…” “FAILED su”
Service failure “failed” or “failure”

Note that ip address is making a ton of requests invalid   Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500

Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “” ip address. suspicious


Screen Shot 2014-03-28 at 12.52.46 PM Links


Python Spark processing auth.log file

Screen Shot 2014-04-22 at 8.55.21 PM

auth log analysis with spark
import sys
from operator import add
from time import strftime
from pyspark import SparkContext

outFile = "counts" + strftime("%Y-%m-%d")
logFile = "auth.log"
destination = "local"
appName = "Simple App"

sc = SparkContext(destination, appName)
logData = sc.textFile(logFile).cache()
failedData = logData.filter(lambda x: 'Failed' in x)
rootData = failedData.filter(lambda s: 'root' in s)

def splitville(line):
 return line.split("from ")[1].split()[0]

ipAddresses =
counts = x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
 print "%s: %i" % (word, count)

Get every new post delivered to your Inbox.

Join 40 other followers