BigSnarf blog

Infosec FTW

Sorry honey, not tonight I’ve gotta get this Apache Spark Fat jar compiled and shipped

Linkedin and Resume Stuff


Software Developer and Data Engineer

Setup Apache Spark SQL loading from S3, CSV, JSON. Setup IPython Notebook server. Setup ElasticSearch server to quickly explore social network analysis profile data and social share data to identify and report user behaviors.

Implemented and shipped streaming data pipeline from social firehoses and built AWS data pipeline consisting of AWS Kinesis, Amazon S3 and EC-2 based Apache Spark cluster.

Implemented ETL processes for Neo4j graph database, searchable in Elasticsearch and demonstrated to end-users in a custom web-app.

Performed ad hoc data exploration and statistical analyses using IPython Notebooks and Apache SparkSQL.

All software and algorithms were developed in Python, Java and Scala. Servers deployed using Ansible and code was automatically deployed via CodeShip to AWS.

I build a variety of data products that use various models and machine learning in support of the group’s mission. My focus is on modelling customer behaviour using data-driven approaches.

Building Real-time Analytics


What I doing right now

I’m building production data pipelines to power customer facing analytics dashboards. I built a high-performance, scalable analytics infrastructure using Amazon Kinesis, Akka Scala, Apache Spark SQL, DStreams, ElasticSearch, and Amazon RDS/DynamoDB that can process data results in real time.
Sometimes I build dashboard prototypes using HTML5, CSS3, Javascript, jQuery, Django, HighCharts and D3.js.

Screenshot from 2015-05-29 10:48:03

Other places where I’ve been

I’m a gun slinger turned code slinger. Writing stuff that humans & computers can read.
Worked at RCMP, CIBC, and Deloitte. @TechStars Chicago 2013. Proud @HackerSchool alumnus Winter 2012.

Things that keep me up at night

I’m interested in identifying threats to security with data science. I been using Spark Dataframes with my IPython Notebooks to interactively explore the datasets. I’m using connected graphs to help identify bad actors.

With a background that includes fraud detection, data loss prevention, network security and computer forensics, I employ data mining and analytics to enterprise IT operations and security.

I help process noisy signals and combine them with rich data sources … feeding and turning weak attack signals into actionable security insights using Spark.

Things that I’m passionate about

* Reading academic papers
* Reading the citations
* Building prototypes
* Testing with random data
* Verifying with production data
* Rebuilding it for production
* Tweeting about it
* Blogging  about it
* Sharing code on my github or bitbucket
* Contributing to open source projects
* Speaking about it

My Code

Streaming Prototype

Apache Spark Use Casesstreaming-arch

Our specific use case

Screen Shot 2015-05-23 at 11.18.22 AM

Kinesis gets raw logs

Screen Shot 2015-05-21 at 5.14.41 PM

Spark Streaming does the counting

Screen Shot 2015-05-21 at 5.14.04 PM

Two Tables Created, One for Kinesis Log Position and the Second for Aggregates

Screen Shot 2015-05-23 at 10.40.15 AM

DynamoDB stores the aggregations

Screen Shot 2015-05-21 at 5.12.11 PM

Building Custom Queries, Grouping, Aggregators and Filters for Apache Spark

Query Metrics

Returns a list of metric values based on a set of criteria. Also returns a set of all tag names and values that are found across the data points.

The time range can be specified with absolute or relative time values. Absolute time values are in milliseconds. Relative time values are specified as an integer duration and a unit. Possible unit values are “milliseconds”, “seconds”, “minutes”, “hours”, “days”, “weeks”, “months”, and “years”. For example, “5 hours” means that metric values submitted 5 hours ago will be returned. The end time is optional. If no end time is specified, the end time is assumed to be now (the current date and time).


The results of the query can be grouped together.There are three ways to group the data; by tags, by a time range, and by value. Grouping is done with the groupBy or groupByKey which is an array of one or more groupers.


Aggregators perform an operation on data points and down samples. For example, you could sum all data points that exist in 5 minute periods.

Aggregators can be combined together. For example, you could sum all data points in 5 minute periods then average them for a week period.


It is possible to filter the data returned by specifying a tag. The data returned will only contain data points associated with the specified tag. Filtering is done using the “tags” property.


Netflix Security tool – FIDO


FIDO is an orchestration layer that automates the incident response process by evaluating, assessing and responding to malware and other detected threats.

Tracking attackers using heatmap visualization with Google Maps v3 and Heatmap Layer

Screen Shot 2015-05-07 at 9.02.12 AM

Amazon introduces ML service

Face Detection Rasperry Pi 2 Day

DataFrames meet Apache Spark 1.3

Spark Scala Notebook incubating Apache video


Get every new post delivered to your Inbox.

Join 50 other followers