BigSnarf blog

Infosec FTW

Monthly Archives: March 2013

ipythonblocks for teaching demos and visualizations

Screen Shot 2013-03-16 at 3.14.47 PM

git clone

Python implementation of Hyperloglog, redis, fuzzy hashing for malware detection


I was thinking there must be a way to use Hyperloglog with fuzzy hash sets of malware rolling over blocks of 32 or 64 bytes. Stick that into a Redis cluster for persistence of objects for full analysis against rolling hashes of sample malware hash sets.  There would be some error but it would give you a quick answer if a sample binary was a fuzzy match for an identified malware hash set in your datastore.  You could also use this to identify copies of a “Top Secret” document on various systems. I’m having a brain block this morning on determining the inclusion and exclusion comparisons. (Either need more coffee or sleep)

Update: I guess simple lists and sets work directly out of the box for redis. No need for HLL yet. Plus I’m still stuck with trying to figure out intersections. I’m not sure what 1,000,000 item list looks like in memory for set comparisons, but I think they will get garbage collected.

Here’s my code:

Determining Inclusion and Exclusion – WIP


Fuzzy Hashing

Redis Stuff 

WIP (need to build using sliding window hash aka. “rolling hash”)

>>> r = redis.Redis(...)
>>> r.set('bing', 'baz')
>>> # Use the pipeline() method to create a pipeline instance
>>> pipe = r.pipeline()
>>> # The following SET commands are buffered
>>> pipe.set('foo', 'bar')
>>> pipe.get('bing')
>>> # the EXECUTE call sends all buffered commands to the server, returning
>>> # a list of responses, one for each command.
>>> pipe.execute()
[True, 'baz']

Processing weblogs with iPython Notebook and pandas

Visualizing time series data and analysis with iPython Notebooks

Cross Validation


Cross-validation, sometimes called rotation estimation,[1][2][3] is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called thetraining set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

d3py takes the awesomeness of d3 bridges it to python bindings

Use Tableau Public to visualize your security data – SSH Passwords

d3.js for Attacker Reports

Classifying tweets with scikit-learn and nltk

Learning about Fit and Predict – Machine Learning with scikit-learn