BigSnarf blog

Infosec FTW

Category Archives: Tools

Apache Spark 1.0.1 – SQL – Load from files, JSON and arrays

D3.js – Crossfilter – Chart Porn

IPython Notebook, Apache Spark, Honeynet Forensic Challenge 10

Hello World – Analysis of StackOverflow dataset using Apache Spark Job Server

Screen Shot 2014-06-13 at 8.32.54 PM

curl --data-binary @target/scala-2.10/spark-jobserver-examples_2.10-1.0.0.jar localhost:8090/jars/sparking
curl 'localhost:8090/jars'
curl 'localhost:8090/contexts'
curl -X POST 'localhost:8090/contexts/users-context'
curl -X POST 'localhost:8090/jobs?appName=sparking&classPath=sparking.jobserver.GetOrCreateUsers&context=users-context'
curl 'localhost:8090/jobs/<insertJobNumberHere>'

https://github.com/bigsnarfdude/spark-jobserver-examples

Cisco enters the big data security analytics with OpenSOC

Screen Shot 2014-06-13 at 4.39.41 PM

Probabilistic Programming with Scala – Hello World

Screen Shot 2014-05-24 at 4.06.20 PM

import com.cra.figaro.language.{Flip, Select} //#A 
import com.cra.figaro.library.compound.If //#A 
import com.cra.figaro.algorithm.factored.VariableElimination //#A

object HelloWorld {
 val sunnyToday = Flip(0.2) //#B 
 val greetingToday = If(sunnyToday, //#C
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#C
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#C 
 val sunnyTomorrow = If(sunnyToday, Flip(0.8), Flip(0.05)) //#D 
 val greetingTomorrow = If(sunnyTomorrow, //#E
 Select(0.6 -> "Hello world!", 0.4 -> "Howdy, universe!"), //#E 
 Select(0.2 -> "Hello world!", 0.8 -> "Oh no, not again")) //#E

 def predict() {
 val algorithm = VariableElimination(greetingToday) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingToday, "Hello world!") //#H 
 println("Tomorrow's greeting is \"Hello world!\" " +
 "with probability " + result + ".") //#I } algorithm.kill() //#J
 }

 def infer() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(sunnyToday) //#F 
 algorithm.start() //#G 
 val result = algorithm.probability(sunnyToday, true) //#H 
 println("If today's greeting is \"Hello world!\", today’s " +
 "weather is sunny with probability " + result + ".") //#I 
 algorithm.kill() //#I
 greetingToday.unobserve() //#L
 }

 def learnAndPredict() {
 greetingToday.observe("Hello world!") //#K 
 val algorithm = VariableElimination(greetingTomorrow) //#F 
 algorithm.start() //#G 
 val result =
 algorithm.probability(greetingTomorrow, "Hello world!") //#H 
 println("If today's greeting is \"Hello world!\", " +
 "tomorrow's greeting will be \"Hello world!\" " +
 "with probability " + result + ".") //#I 
 algorithm.kill() //#J 
 greetingToday.unobserve() //#L
 }

 def main(args: Array[String]) { //#M 
 predict() //#M 
 infer() //#M
 learnAndPredict() //#M
 }
}

Apache Spark Job Server – Getting Started Hello World

Screen Shot 2014-05-23 at 7.10.27 AM

Clone Ooyala’s Spark Job Server

$ git clone https://github.com/ooyala/spark-jobserver
$ cd spark-jobserver

Using SBT, publish it to your local repository and run it

$ sbt publish-local
$ sbt
> re-start

WordCountExample walk-through

First, to package the test jar containing the WordCountExample:

sbt job-server-tests/package

Then go ahead and start the job server using the instructions above.

Let’s upload the jar:

curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test
OK⏎

The above jar is uploaded as app test. Next, let’s start an ad-hoc word count job, meaning that the job server will create its own SparkContext, and return a job ID for subsequent querying:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
  "status": "STARTED",
  "result": {
    "jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
    "context": "b7ea0eb5-spark.jobserver.WordCountExample"
  }
}⏎

 

https://github.com/bigsnarfdude/spark-jobserver

https://github.com/fedragon/spark-jobserver-examples

Using Spark to do real-time large scale log analysis —> unified logging

Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO. In other posts you I will document pySpark batch, and DStream algorithms in Scala processing of auth.log Screen Shot 2014-03-26 at 8.56.29 PM http://nbviewer.ipython.org/urls/raw.githubusercontent.com/bigsnarfdude/PythonSystemAdminTools/master/auth_log_analysis_spark.ipynb?create=1 These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”, “Accepted publickey”, “session opened”
Failed user login “authentication failure”, “failed password”
User log-off “session closed”
User account change or deletion “password changed”, “new user”, “delete user”
Sudo actions “sudo: … COMMAND=…” “FAILED su”
Service failure “failed” or “failure”

Note that ip address 219.192.113.91 is making a ton of requests invalid   Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500

Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “219.150.161.20″ ip address. suspicious

spark-streaming21

Screen Shot 2014-03-28 at 12.52.46 PM Links

datapipeline_simple

Python Spark processing auth.log file

Screen Shot 2014-04-22 at 8.55.21 PM

"""
auth log analysis with spark
SimpleApp.py
"""
import sys
from operator import add
from time import strftime
from pyspark import SparkContext

outFile = "counts" + strftime("%Y-%m-%d")
logFile = "auth.log"
destination = "local"
appName = "Simple App"

sc = SparkContext(destination, appName)
logData = sc.textFile(logFile).cache()
failedData = logData.filter(lambda x: 'Failed' in x)
rootData = failedData.filter(lambda s: 'root' in s)

def splitville(line):
 return line.split("from ")[1].split()[0]

ipAddresses = rootData.map(splitville)
counts = ipAddresses.map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
 print "%s: %i" % (word, count)
counts.saveAsTextFile(outFile)

https://github.com/bigsnarfdude/PythonSystemAdminTools/blob/master/SimpleApp.py

Scala and Algebird example in REPL

Screen Shot 2014-04-22 at 9.32.08 AM

//scala wordcount example
import scala.io.Source
val lines = Source.fromFile("README.md").getLines.toArray
val emptyCounts = Map[String,Int]().withDefaultValue(0)
words.length
val counts = words.foldLeft(emptyCounts)({(currentCounts: Map[String,Int], word: String) => currentCounts.updated(word, currentCounts(word) + 1)})



//algebird hyperloglog
import HyperLogLog._
val hll = new HyperLogLogMonoid(4)
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val seqHll = data.map { hll(_) }
val sumHll = hll.sum(seqHll)
val approxSizeOf = hll.sizeOf(sumHll)
val actualSize = data.toSet.size
val estimate = approxSizeOf.estimate

//algebird bloomfilter
import com.twitter.algebird._
val NUM_HASHES = 6
val WIDTH = 32
val SEED = 1
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)
val bf = bfMonoid.create("1", "2", "3", "4", "100")
val approxBool = bf.contains("1")
val res = approxBool.isTrue

//algebird countMinSketch
import com.twitter.algebird._
val DELTA = 1E-10
val EPS = 0.001
val SEED = 1
val CMS_MONOID = new CountMinSketchMonoid(EPS, DELTA, SEED)
val data = List(1L, 1L, 3L, 4L, 5L)
val cms = CMS_MONOID.create(data)
cms.totalCount
cms.frequency(1L).estimate
cms.frequency(2L).estimate
cms.frequency(3L).estimate
val data = List("1", "2", "3", "4", "5")
val cms = CMS_MONOID.create(data)

//sketch map
import com.twitter.algebird._
val DELTA = 1E-8
val EPS = 0.001
val SEED = 1
val HEAVY_HITTERS_COUNT = 10

implicit def string2Bytes(i : String) = i.toCharArray.map(_.toByte)


val PARAMS = SketchMapParams[String](SEED, EPS, DELTA, HEAVY_HITTERS_COUNT)
val MONOID = SketchMap.monoid[String, Long](PARAMS)
val data = List( ("1", 1L), ("3", 2L), ("4", 1L), ("5", 1L) )
val sm = MONOID.create(data) 
sm.totalValue
MONOID.frequency(sm, "1")
MONOID.frequency(sm, "2")
MONOID.frequency(sm, "3")

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
Follow

Get every new post delivered to your Inbox.

Join 40 other followers