BigSnarf blog

Infosec FTW

Monthly Archives: April 2012

Using Data Mining for Malware Analysis

I came across this tweet and followed the link.  Data mining can serve as another method for malware analysis.  A few anti-malware vendors already use data mining.  Some forensic practitioners use data mining and visual link analysis to identify systems compromised with malware.  Data mining has its place as we move to larger datasets.

Read more here:

Currently Reading:

Book on my wishlist:

Programming Collective Intelligence – still stands as king for Intro to Machine Learning

Analysis of data in movement – streams, network, memory, file writes

I found a tool that helps captures files in a network stream.  Looking for other tools that look at data in real-time? This tool is called Streams and released last year.  Source can be found here:

Here is a presentation of  Storm: the Hadoop of Realtime Stream Processing

Another blog article on capturing data streams on big data systems

Another tool is:

Setting up your machine for R and Machine Learning

Setting up your machine to use R packages. Source: Machine Learning for Hackers













2013: Just an update. I’m working through the book using the iPython Notebook data-machine-learning stack. There are also instructions in this blog to on my OSX python data stack build.

Here is my code:

Commercial Big Data Infosec Players

Building some sort of security rules or predictive engines is hard stuff. I think it’s going to be hard for organization to organize, collect and analyze terabytes of security data. Good big data analytics seems to be  rocket science type stuff. These vendors are offering big data products for Infosec to consume:

Adobe releases J48 code for malware classification – wordcloud of code Infographic

Adobe Systems has released a malware classification tool in order to help security incident first responders, malware analysts and security researchers more easily identify malicious binary files.  So I downloaded the python script to look at what all the fuss was about on reddit.  Dragged the code into wordle and isDirty stands out.  I wasn’t entirely familiar with J48 and after some hunting I found out it was a decision tree.

Wikipedia says…a decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by aProbability model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

I still wasn’t satisfied because I really didn’t understand how the script really worked.  I ended up finding this awesome book with d’uh a simple Google query “data mining malware”.  Suffice to say I will have some reading in the next couple of days. Screenshots below from book: Data Mining Tools for Malware Detection.

Evidence based decision making – Instant access to diagnosis, treatment and prognosis

Wouldn’t it be nice to get snapshot of system RAM, MD5 of 100k HDD files, and 100mb PCAP and upload to “cloud” and get health status of box? Result: System is Trusted.  Infosec doesn’t have enough studies in defense research and attack research to help Infosec practitioners answer questions that easily. There are severe limits of the fields knowledge within Infosec.  I think there are systematic problems with Infosec and the industry. particularly information/data sharing. There isn’t a publicly available compendium to knowledge for us to query. Without these tools, Infosec is shooting from the hip, on “past intuition”.  Data-driven security will have to be more than metrics.

Learning from the study of medicine

In the above screenshot, is a tool doctors use called First Consult.  First Consult is an evidence-based clinical information resource for healthcare professionals. Designed for use at point of care, it provides instant, user-friendly access to the latest information on evaluation, diagnosis, clinical management, prognosis, and prevention. Data-driven security will have to be more than metrics.

Hadoop and R Tutorials – Big Data Step-by-Step

Using R to find recommendations to what to eat

Below is R script snippets that I put into R studio.


Rabbit <- c(10, 7, 1, 2, NA, 1)
Cow <- c( 7, 10, NA, NA, NA, NA)
Dog <- c(NA, 1, 10, 10, NA, NA)
Pig <- c( 5, 6, 4, NA, 7, 3)
Chicken <- c( 7, 6, 2, NA, 10, NA)
Pinguin <- c( 2, 2, NA, 2, 2, 10)
Bear <- c( 2, NA, 8, 8, 2, 7)
Lion <- c(NA, NA, 9, 10, 2, NA)
Tiger <- c(NA, NA, 8, NA, NA, 5)
Antilope <- c( 6, 10, 1, 1, NA, NA)
Wolf <- c( 1, NA, NA, 8, NA, 3)
Sheep <- c(NA, 8, NA, NA, NA, 2)

Create Array
animals <- c(“Rabbit”,”Cow”,”Dog”,”Pig”,”Chicken”,”Pinguin”,”Bear”,”Lion”,”Tiger”,”Antilope”,”Wolf”,”Sheep”)
foods <- c(“Carrots”,”Grass”,”Pork”, “Beef”, “Corn”, “Fish”)
matrixRowAndColNames <- list(animals, foods)

Create Matrix
animal2foodRatings <-matrix(data=c(Rabbit,Cow,Dog,Pig,Chicken,Pinguin,Bear,Lion,Tiger,Antilope,Wolf,Sheep),nrow=12,ncol=6,byrow=TRUE,matrixRowAndColNames)

animal2foodRatingsWithMean <- animal2foodRatingsanimal2foodRatingsWithMean[] <- mean(rowMeans(animal2foodRatingsRecMatrix))

FactorStructure <- svd(animal2foodRatingsWithMean)
D <- diag(FactorStructure$d)
PredictedRatings <- FactorStructure$u %*% D %*% t(FactorStructure$v)
dimnames(PredictedRatings) <- matrixRowAndColNames

PredictiveMatrix <- matrix(nrow=length(animals), ncol=length(foods))
dimnames(PredictiveMatrix) <- matrixRowAndColNames
# Sheep Carrots prediction
k <- 2
for(animal in 1:length(animals)) {
for(food in 1:length(foods)) {
PredictiveMatrix[animal,food] <- (((FactorStructure$u[animal,1:k]*sqrt(FactorStructure$d[1:k]))%*%(sqrt(FactorStructure$d[1:k])*t(FactorStructure$v)[1:k,food]))[1,1])

animal2foodRatingsRecMatrix <- as(animal2foodRatings, “realRatingMatrix”)
animal2foodRatingsRecMatrix_n <- normalize(animal2foodRatingsRecMatrix)
animal2foodRatingsRecMatrix_n2 <- normalize(animal2foodRatingsRecMatrix, method=”Z-score”)

# Average user rating
# Average number of ratings per User
# Average number of ratings per Item
# Amount of all ratings
# Histogram of ratings
hist(getRatings(animal2foodRatingsRecMatrix), breaks=10, main=paste(“Distribution of Ratings”))

image(animal2foodRatingsRecMatrix, main=”Raw Data”)
image(animal2foodRatingsRecMatrix_n, main=”Centered”)
image(animal2foodRatingsRecMatrix_n2, main=”Z-Score Normalization”)

rec <- Recommender(animal2foodRatingsRecMatrix[1:10,], method = “IBCF”)

Running Mahout Hadoop Taste Recommender algorithm on example Grouplens dataset

Here are the links I found on my CDH3 tutorial setup. Second is recommendation guide

The is a completely distributed itembased recommender. It expects a .csv file with preference for data as input. Here is an example of the csv file I inputted:


So at the end of the processing,  I end up with a more data in file. Impressions: Mahout is for developers needing large scale processing of data with some limitations to algorithms. WEKA is mainly for data mining analysts and learners. The GUI and “autoAlgorithm” feature make it easier for beginners to process data. WEKA will have issues scaling to very large datasets because of memory limitations.