BigSnarf blog

Infosec FTW

Monthly Archives: April 2014

Using Spark to do real-time large scale log analysis

Spark on IPython Notebook used for analytic workflows of the auth.log.

What is really cool about this Spark platform is that I can either batch or data mine the whole dataset on the cluster. Built on this idea. In the screenshot below, you can see me using my IPython Notebook for interactive query. All the code I create to investigate the auth.log can easily be converted to Spark Streaming DStream objects in Scala. Effectively, I can build a real-time application all from the same platform. “Cool” IMHO. In other posts you I will document pySpark batch, and DStream algorithms in Scala processing of auth.log Screen Shot 2014-03-26 at 8.56.29 PM These are some of the items I am filtering in my PySpark interactive queries in the Notebook

Successful user login “Accepted password”, “Accepted publickey”, “session opened”
Failed user login “authentication failure”, “failed password”
User log-off “session closed”
User account change or deletion “password changed”, “new user”, “delete user”
Sudo actions “sudo: … COMMAND=…” “FAILED su”
Service failure “failed” or “failure”

Note that ip address is making a ton of requests invalid   Maybe I should correlate to web server request logs too?

Excessive access attempts to non-existent files
Code (SQL, HTML) seen as part of the URL
Access to extensions you have not implemented
Web service stopped/started/failed messages
Access to “risky” pages that accept user input
Look at logs on all servers in the load balancer pool
Error code 200 on files that are not yours
Failed user authentication Error code 401, 403
Invalid request Error code 400
Internal server error Error code 500

Here is the data correlated to Successful Logins, Failed Logins and Failed logins to an invalid user. Notice the “” ip address. suspicious


Screen Shot 2014-03-28 at 12.52.46 PM Links


Python Spark processing auth.log file

Screen Shot 2014-04-22 at 8.55.21 PM

auth log analysis with spark
import sys
from operator import add
from time import strftime
from pyspark import SparkContext

outFile = "counts" + strftime("%Y-%m-%d")
logFile = "auth.log"
destination = "local"
appName = "Simple App"

sc = SparkContext(destination, appName)
logData = sc.textFile(logFile).cache()
failedData = logData.filter(lambda x: 'Failed' in x)
rootData = failedData.filter(lambda s: 'root' in s)

def splitville(line):
 return line.split("from ")[1].split()[0]

ipAddresses =
counts = x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
 print "%s: %i" % (word, count)

Scala and Algebird example in REPL
Screen Shot 2014-04-22 at 9.32.08 AM

//scala wordcount example
val lines = Source.fromFile("").getLines.toArray
val emptyCounts = Map[String,Int]().withDefaultValue(0)
val counts = words.foldLeft(emptyCounts)({(currentCounts: Map[String,Int], word: String) => currentCounts.updated(word, currentCounts(word) + 1)})

//algebird hyperloglog
import HyperLogLog._
val hll = new HyperLogLogMonoid(4)
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val seqHll = { hll(_) }
val sumHll = hll.sum(seqHll)
val approxSizeOf = hll.sizeOf(sumHll)
val actualSize = data.toSet.size
val estimate = approxSizeOf.estimate

//algebird bloomfilter
import com.twitter.algebird._
val NUM_HASHES = 6
val WIDTH = 32
val SEED = 1
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)
val bf = bfMonoid.create("1", "2", "3", "4", "100")
val approxBool = bf.contains("1")
val res = approxBool.isTrue

//algebird countMinSketch
import com.twitter.algebird._
val DELTA = 1E-10
val EPS = 0.001
val SEED = 1
val CMS_MONOID = new CountMinSketchMonoid(EPS, DELTA, SEED)
val data = List(1L, 1L, 3L, 4L, 5L)
val cms = CMS_MONOID.create(data)
val data = List("1", "2", "3", "4", "5")
val cms = CMS_MONOID.create(data)

//sketch map
import com.twitter.algebird._
val DELTA = 1E-8
val EPS = 0.001
val SEED = 1

implicit def string2Bytes(i : String) =

val MONOID = SketchMap.monoid[String, Long](PARAMS)
val data = List( ("1", 1L), ("3", 2L), ("4", 1L), ("5", 1L) )
val sm = MONOID.create(data) 
MONOID.frequency(sm, "1")
MONOID.frequency(sm, "2")
MONOID.frequency(sm, "3")

Spark Streaming Word Count from Network Socket in Scala

Screen Shot 2014-04-21 at 9.14.30 PM
package com.wordpress.bigsnarf.spark.streaming.examples

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._

 * Counts words in text encoded with UTF8 received from the network every second.

object NetworkWordCount {
 def main(args: Array[String]) {

 if (args.length < 3) {
   System.err.println("Usage: NetworkWordCount <master> <hostname> <port>\n" +
   "In local mode, <master> should be 'local[n]' with n > 1")


 // Spark Context
 val ssc = new StreamingContext(args(0), 

 // Socket takes Text Stream
 val lines = ssc.socketTextStream(args(1), 
 // split words
 val words = lines.flatMap(_.split(" "))

 // count the words tuple -> reduce
 val wordCounts = => (x, 1)).reduceByKey(_ + _)

 // print to screen