BigSnarf blog

Infosec FTW

Monthly Archives: March 2013

What side you on? Blue Team or Red Team? OSS Security Distros

sguil_rocks

REMnux < SIFT Kit < Security Onion < IPCop > Samurai WTF > BackTrack > Kali

Advertisements

Dude where’s my naive bayes?

bayestshirt

naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model”.  An overview of statistical classifiers is given in the article on Pattern recognition.

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://github.com/bigsnarfdude/machineLearning/blob/master/mason_vs_sklearn_naive_bayes.py

Processing million lines logs with iPython Notebooks

Twitter Big Data Infrastructure – Redis, Lucene, Hadoop

twitterHadoop

http://engineering.twitter.com/2012/08/visualizing-hadoop-with-hdfs-du.html

http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf

http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/UCBTwitter_Course_Intro_Aug23_20121.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/UCB_TwitterIntro2_Aug23_2012.pdf
http://people.ischool.berkeley.edu/~hearst/talks/raffi-krikorian-uc-berkeley-2012.08.27.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/laraki_ucb_twitter_course_aug28.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/coveney_pig_lecture.pdf
http://people.ischool.berkeley.edu/~hearst/twitter_lectures/snow_twitter_api_sept_11.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/Kostas_Trends_Sept_13_2012.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/Berkeley-Twitter-Class-09.25.2012-final.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/sharma_twitter_graphs.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/gonzalez_biglearning_with_graphs.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/alpa_twitter_recommenders1.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/10/thomas_security_twitter.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/10/stan_diffusion_twitter.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/twitter_scalding.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/final_lecture.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/A3_review.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/twitter_scalding.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/10/stan_diffusion_twitter.pdf
http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/10/thomas_security_twitter.pdf

iPython Notebooks with Redis to store lists, sets, and python objects using pickle

python hyperloglog and webscale counters

from mmhash import mmhash
from math import log
from zlib import compress
from base64 import b64encode
class HyperLogLog:
 def __init__(self, log2m):
 self.log2m = log2m
 self.m = 1 << log2m
 self.data = [0]*self.m
 self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.def offer(self, o):
 x = mmhash(str(o), 0)
 a, b = 32-self.log2m, self.log2m
 i = x >> a
 v = self._bitscan(x << b, a)
 self.data[i] = max(self.data[i], v)
def count(self):
 estimate = self.alphaMM / sum([2**-v for v in self.data])
 if estimate <= 2.5 * self.m:
 zeros = float(self.data.count(0))
 return round(-self.m * log(zeros / self.m))
 else:
 return round(estimate)

 def _bitscan(self, x, m):
 v = 1
 while v<=m and not x&0x80000000:
 v+=1
 x<<=1
 return v

 def datastr(self):
 return b64encode(compress(str.join('', map(chr, self.data)), 9))

Stacked bar charts work better to tell the story – barh not enough by itself

Bar chart reporting presents data but it doesn’t provide context alone

Screen Shot 2013-03-21 at 11.01.48 PM

In this example stacked bar chart can show you volume comparisons but it might be difficult to gauge counts

Screen Shot 2013-03-22 at 9.38.17 AM

Transition to side by side bar chart can help display counts

Screen Shot 2013-03-22 at 9.38.12 AM

http://nbviewer.ipython.org/urls/raw.github.com/bigsnarfdude/bsides_vancouver_2013/master/05-TimeSeriesReview.ipynb

Consume the data you have with tools that answer most of your questions quickly

Screen Shot 2013-03-21 at 10.12.21 AM

Facebook has committed to democratizing their data to users. In this blog post, he details his experience to get data to the masses.

http://vizwiz.blogspot.ca/2013/03/how-we-built-tableau-tribe-at-facebook.html

Big Data Mind Map – Interesting

d3.js mixedtape tutorials – creators gotta create