BigSnarf blog

Infosec FTW

Hello World – Analysis of StackOverflow dataset using Apache Spark Job Server

Screen Shot 2014-06-13 at 8.32.54 PM

curl --data-binary @target/scala-2.10/spark-jobserver-examples_2.10-1.0.0.jar localhost:8090/jars/sparking
curl 'localhost:8090/jars'
curl 'localhost:8090/contexts'
curl -X POST 'localhost:8090/contexts/users-context'
curl -X POST 'localhost:8090/jobs?appName=sparking&classPath=sparking.jobserver.GetOrCreateUsers&context=users-context'
curl 'localhost:8090/jobs/<insertJobNumberHere>'

Cisco enters the big data security analytics with OpenSOC

Screen Shot 2014-06-13 at 4.39.41 PM

Using machine learning for anomaly detection is not new but …

Probabilistic Programming with Scala – Hello World

Screen Shot 2014-05-24 at 4.06.20 PM

import com.cra.figaro.language.{Flip, Select} //#A 
import com.cra.figaro.library.compound.If //#A 
import com.cra.figaro.algorithm.factored.VariableElimination //#A

object HelloWorld {
 val sunnyToday = Flip(0.2) //#B 
 val greetingToday = If(sunnyToday, //#C
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#C
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#C 
 val sunnyTomorrow = If(sunnyToday, Flip(0.8), Flip(0.05)) //#D 
 val greetingTomorrow = If(sunnyTomorrow, //#E
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#E 
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#E

 def predict() {
 val algorithm = VariableElimination(greetingToday) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingToday, "Hello world!") //#H 
 println("Tomorrow's greeting is \"Hello world!\" " +
 "with probability " + result + ".") //#I } algorithm.kill() //#J

 def infer() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(sunnyToday) //#F 
 algorithm.start() //#G 
 val result = algorithm.probability(sunnyToday, true) //#H 
 println("If today's greeting is \"Hello world!\", today’s " +
 "weather is sunny with probability " + result + ".") //#I 
 algorithm.kill() //#I
 greetingToday.unobserve() //#L

 def learnAndPredict() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(greetingTomorrow) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingTomorrow, "Hello world!") //#H 
 println("If today's greeting is \"Hello world!\", " +
 "tomorrow's greeting will be \"Hello world!\" " +
 "with probability " + result + ".") //#I 
 algorithm.kill() //#J 
 greetingToday.unobserve() //#L

 def main(args: Array[String]) { //#M 
 predict() //#M 
 infer() //#M
 learnAndPredict() //#M

Apache Spark Job Server – Getting Started Hello World

Screen Shot 2014-05-23 at 7.10.27 AM

Clone Ooyala’s Spark Job Server

$ git clone
$ cd spark-jobserver

Using SBT, publish it to your local repository and run it

$ sbt publish-local
$ sbt
> re-start

WordCountExample walk-through

First, to package the test jar containing the WordCountExample:

sbt job-server-tests/package

Then go ahead and start the job server using the instructions above.

Let’s upload the jar:

curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test

The above jar is uploaded as app test. Next, let’s start an ad-hoc word count job, meaning that the job server will create its own SparkContext, and return a job ID for subsequent querying:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
  "status": "STARTED",
  "result": {
    "jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
    "context": "b7ea0eb5-spark.jobserver.WordCountExample"

Algebird for Infosec Analytics

Using Spark to do real-time large scale log analysis —> unified logging

Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea. In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO. In other posts you I will document pySpark batch, and DStream algorithms in Scala processing of auth.log Screen Shot 2014-03-26 at 8.56.29 PM These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”, “Accepted publickey”, “session opened”
Failed user login “authentication failure”, “failed password”
User log-off “session closed”
User account change or deletion “password changed”, “new user”, “delete user”
Sudo actions “sudo: … COMMAND=…” “FAILED su”
Service failure “failed” or “failure”

Note that ip address is making a ton of requests invalid   Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500

Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “” ip address. suspicious


Screen Shot 2014-03-28 at 12.52.46 PM Links


Python Spark processing auth.log file

Screen Shot 2014-04-22 at 8.55.21 PM

auth log analysis with spark
import sys
from operator import add
from time import strftime
from pyspark import SparkContext

outFile = "counts" + strftime("%Y-%m-%d")
logFile = "auth.log"
destination = "local"
appName = "Simple App"

sc = SparkContext(destination, appName)
logData = sc.textFile(logFile).cache()
failedData = logData.filter(lambda x: 'Failed' in x)
rootData = failedData.filter(lambda s: 'root' in s)

def splitville(line):
 return line.split("from ")[1].split()[0]

ipAddresses =
counts = x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
 print "%s: %i" % (word, count)

Scala and Algebird example in REPL

Screen Shot 2014-04-22 at 9.32.08 AM

//scala wordcount example
val lines = Source.fromFile("").getLines.toArray
val emptyCounts = Map[String,Int]().withDefaultValue(0)
val counts = words.foldLeft(emptyCounts)({(currentCounts: Map[String,Int], word: String) => currentCounts.updated(word, currentCounts(word) + 1)})

//algebird hyperloglog
import HyperLogLog._
val hll = new HyperLogLogMonoid(4)
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val seqHll = { hll(_) }
val sumHll = hll.sum(seqHll)
val approxSizeOf = hll.sizeOf(sumHll)
val actualSize = data.toSet.size
val estimate = approxSizeOf.estimate

//algebird bloomfilter
import com.twitter.algebird._
val NUM_HASHES = 6
val WIDTH = 32
val SEED = 1
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)
val bf = bfMonoid.create("1", "2", "3", "4", "100")
val approxBool = bf.contains("1")
val res = approxBool.isTrue

//algebird countMinSketch
import com.twitter.algebird._
val DELTA = 1E-10
val EPS = 0.001
val SEED = 1
val CMS_MONOID = new CountMinSketchMonoid(EPS, DELTA, SEED)
val data = List(1L, 1L, 3L, 4L, 5L)
val cms = CMS_MONOID.create(data)
val data = List("1", "2", "3", "4", "5")
val cms = CMS_MONOID.create(data)

//sketch map
import com.twitter.algebird._
val DELTA = 1E-8
val EPS = 0.001
val SEED = 1

implicit def string2Bytes(i : String) =

val MONOID = SketchMap.monoid[String, Long](PARAMS)
val data = List( ("1", 1L), ("3", 2L), ("4", 1L), ("5", 1L) )
val sm = MONOID.create(data) 
MONOID.frequency(sm, "1")
MONOID.frequency(sm, "2")
MONOID.frequency(sm, "3")

Spark Streaming Word Count from Network Socket in Scala

Screen Shot 2014-04-21 at 9.14.30 PM
package com.wordpress.bigsnarf.spark.streaming.examples

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._

 * Counts words in text encoded with UTF8 received from the network every second.

object NetworkWordCount {
 def main(args: Array[String]) {

 if (args.length < 3) {
   System.err.println("Usage: NetworkWordCount <master> <hostname> <port>\n" +
   "In local mode, <master> should be 'local[n]' with n > 1")


 // Spark Context
 val ssc = new StreamingContext(args(0), 

 // Socket takes Text Stream
 val lines = ssc.socketTextStream(args(1), 
 // split words
 val words = lines.flatMap(_.split(" "))

 // count the words tuple -> reduce
 val wordCounts = => (x, 1)).reduceByKey(_ + _)

 // print to screen

Get every new post delivered to your Inbox.

Join 40 other followers