BigSnarf blog

Infosec FTW

Monthly Archives: January 2013

iPython processing Apache logs with generators and visualizing with matplotlib

Screen Shot 2013-01-20 at 11.26.02 AM

  • Grab logs from multiple data-centers
  • Split out anonymized and non-anonymized data into two separate files
  • Store both sets of files in HDFS – Hadoop FTW (experiments done)
  • Create HIVE queries (experiments done)
  • Query the data
  • Yum! Stats!

Refactored for easier queries

Screen Shot 2013-01-20 at 2.28.34 PM

J48, J48 Graft, PART, and Ridor for classification of 100,000 malicious vs 16,000 clean programs


Perform quick, easy classification of binaries for malware analysis.

Malware Classifier is a command-line tool that lets antivirus analysts, IT administrators, and security researchers quickly and easily determine if a binary file contains malware: so they can develop malware detection signatures faster, reducing the time during which users’ systems are vulnerable.

The tool uses machine-learning algorithms to classify Win32 binaries – EXEs and DLLs – into three classes: 0 for “clean,” 1 for “malicious,” or “UNKNOWN.”

The tool extracts seven key features from an unknown binary, feeds them to one of the four classifiers or all of them, and presents its classification of the unknown binary as “clean,” “malicious,” or “unknown.”

The tool was developed using models resultant from running the J48, J48 Graft, PART, and Ridor machine-learning algorithms on a dataset of approximately 100,000 malicious programs and 16,000 clean programs.

Big Data Security Challenges

I don’t usually reblog, word for word, but this article sums up my intentions when  I started my journey to Big Data, Visualizations, OSS, Python, etc. It was all in an attempt to understand and build something to answer my own questions.

Reblogged from:

Big Data Security Challenges

Collecting massive amounts of security data is easy. Data analysis and visualization? Not so much.

By joltsik on Thu, 01/17/13 – 10:55am.

 According to ESG Research, 47% of enterprise organizations collect 6TB of security data or more on a monthly basis to support their cybersecurity analysis requirements. Furthermore, 43% of enterprise organizations collect “substantially more” security data then they did 2 years ago while an additional 43% of enterprise organizations collect “somewhat more” security data then they did 2 years ago.

Just what types of data are they collecting? Everything. User activities, firewall logs, asset data, vulnerability scans, DNS logs, etc. Most enterprises aren’t collecting, storing, and analyzing large volumes of network packets (i.e. Full-packet capture or PCAP) today but they will increasingly do so in the future. Once this happens, security data volume collection will take another quantum leap.

If this activity doesn’t signal the need for big data security analytics than nothing does. Nevertheless, CISOs’ need go beyond dumping a bunch of unstructured data in a Hadoop cluster.

So what’s required? To find out, ESG recently surveyed 257 security professionals working at North American-based enterprise organizations (i.e. more than 1,000 employees) and asked them a series of questions about security data collection, processing, and analysis. As part of this project, security professionals were asked to identify specific difficulties around security data collection and analysis. The top 2 problems revealed were:

• 62% of enterprise organizations have “significant difficulties “ or “some difficulties” with security data visualization
• 53% of enterprise organizations have “significant difficulties “ or “some difficulties” with security data analysis

Existing security analytics tools tend to catch obvious attacks or provide a 50,000 foot perspective of the network. Security analysts and CISOs need an atomic view of packets, protocols, payloads, and behavior over various timeframes – seconds, minutes, days, weeks, months, etc. They need visualization tools that provide context of what’s normal, what’s anomalous, and what’s extremely dangerous. Finally, they need security technology to do more of the heavy lifting analysis. Forget big data technology buzz words like NoSQL, Cassandra, and MapReduce. CISOs need data analysis and visualization not just a bigger file system for unstructured data.

Lock down your network all you can but you will still need continuous monitoring and big data tools to analyze and visualize the billions of IT activities that happen each day to attain situational awareness and make tactical security adjustments.

This is the near future of enterprise security analytics. The vendor that provides big data backend technologies along with superior analytics intelligence and visualization will win big.

Additional reading:

Which one are you in the stack?


I’m an early morning or late evening coder – what does your github commits look like?

Building your own search engine in Python

  • Learn core concepts of search
  • Learn associated terminology
  • Understand it is document-based search not RDBMS
  • Inverted Index is what is searched and linked to document
  • Python code – inverted index class
  • Technique for stemming words
  • Understand N-grams
  • Understand  tokenizers and n-gram processing
  • Understanding fields
  • Understand document handler
  • Search Engine
  • Concept of Sharding
  • Concept of Faceting
  • Concept of Boost


min_gram = 3
max_gram = 6

for position, token in enumerate(tokens):
  for window_length in range(min_gram, min(max-gram) + 1, len(token))):
    gram = token[:window_length]
    terms.setdefault(gram, set([]))

Update link:

Machine Learning use cases

  • Churn Prediction
  • Sentiment Analysis
  • Truth and Veracity
  • Recommendation Engine
  • Online Advertisement
  • News Aggregation
  • Scalability with Big Data
  • Content Discovery – Search
  • Intelligent Learning
  • Malware Data Mining
  • SPAM detection
  • Anomaly Detection

Who doesn’t love Big Data and Comics


Github Data Geeks Team Contributions Calendar

Plotting github contributions

Screen Shot 2013-01-08 at 1.14.33 AM