BigSnarf blog

Infosec FTW

Category Archives: Tools

Mahout Parallel Frequent Pattern Mining

Flow

Screen Shot 2013-05-13 at 10.39.15 PM

AOL Moloch is PCAP Elasticsearch full packet search

moloch-stats

https://github.com/aol/moloch

Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly. Simple security is implemented by using HTTPS and HTTP digest password support or by using apache in front. Moloch is not meant to replace IDS engines but instead work along side them to store and index all the network traffic in standard PCAP format, providing fast access. Moloch is built to be deployed across many systems and can scale to handle multiple gigabits/sec of traffic.

Mandelbrot Set

Screen Shot 2013-04-16 at 10.32.38 PM

 

PyOpenCL

PyOpenCL lets you access the OpenCL parallel computation API from Python. Here’s what sets PyOpenCL apart:

  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code.
  • Completeness. PyOpenCL puts the full power of OpenCL’s API at your disposal, if you wish.
  • Convenience. While PyOpenCL’s primary focus is to make all of OpenCL accessible, it tries hard to make your life less complicated as it does so–without taking any shortcuts.
  • Automatic Error Checking. All OpenCL errors are automatically translated into Python exceptions.
  • Speed. PyOpenCL’s base layer is written in C++, so all the niceties above are virtually free.
  • Helpful, complete documentation and a wiki.
  • Liberal licensing (MIT).

Documentation

See the PyOpenCL Documentation.

Support

Having trouble with PyOpenCL? First, you may want to check the PyOpenCL Wiki. If that doesn’t help, maybe the nice people on the PyOpenCL mailing list can.

Download

Download PyOpenCL here.

Or get it directly from my source code repository by typing

git clone http://git.tiker.net/trees/pyopencl.git

You may also browse the source.

Prerequisites: All you need is an OpenCL implementation. And Python obviously.

Formatting code in iPython

Extracting features out of web logs to identify Human vs. Robot

robot

Classifying traffic intensity and temporary differences in access

  1. Total pages request per IP address
  2. Percentage of images requested
  3. Percentage of binaries requested like pdf
  4. Total request for robots.txt
  5. Percentage of HTML pages requested
  6. Percentage of text files requested
  7. Percentage of zip files requested
  8. Percentage of video files requested
  9. Bounce rate
  10. Session time
  11. Standard deviation between clicks
  12. Percentage of night time requests
  13. Percentage of errors
  14. Percentage of garbage requests
  15. Percentage of GETS
  16. Percentage of POSTS
  17. Percentage of HEAD
  18. URL traversal
  19. Depth of URL traversal
  20. Pathlength
  21. Referrer
  22. User Agents
  23. IP Address location
  24. Known crawler IP addresses
  25. Repeated requests
  26. Average time between clicks
  27. OS badges
  28. ARIN registration
  29. ASN analysis
  30. Geolocation

Security Data Visualization

Screen Shot 2013-02-16 at 8.20.05 AM


Tableau Public is for anyone who wants to tell stories with interactive data on the web. It’s delivered as a service which allows you to be up and running overnight. With Tableau Public you can create amazing interactive visuals and publish them quickly, without the help of programmers or IT.

The Premium version of Tableau Public is for organizations that want to enhance their websites with interactive data visualizations. There are higher limits on the size of data you can work with. And among other premium features, you can keep your underlying data hidden.

Why tell stories with data? Because interactive content drives more page views and longer dwell time. Industry experts have cited figures showing that the average reading time of a web page with an interactive visual is 4, 5 or 6 times that of a static web page.

http://www.tableausoftware.com/products/public

My thoughts on building a security data analytics practice in an organization

bigsnarfjourney

Build People, Processes and Policies

  1. Gather all the questions that need to be answered
  2. Select team members
  3. Develop data preparation workflows
  4. Select data preparation tools (python, bash, hadoop)
  5. Develop how you want to consume and present data to users and consumers
  6. Select data presentation tools (tableau, ipython notebooks, d3.js)
  7. Develop the experimentation workflow (tools etc)
  8. Observe and analyze experiment outcomes (gotta build stuff / POC)
  9. Build data products and optimize (POC => WIP => Prod1.0 => Prod2.0)
  10. Train anyone and everyone to love you data products
  11. Build data products you love

Analysis of the current environment

  1. What questions need to be asked? What questions need to be answered? Who need these answers? How fast?
  2. Where is your data now, how is it stored, who controls it, how do you get access
  3. Are you getting the right kinds of data? Is it in the format you want? Is the systems in place answering 90% of questions?
  4. Consider instrumenting everything
  5. Consider storing all the data in one place. Figure out how to protect and monitor access.
  6. Need data to feed the algorithms to feed the peoples questions
  7. You need to store the data then you can process from unstructured to structured data
  8. Consume the data you have first before building
  9. Plan on keeping all the data forever
  10. Build data products for self service, exploration and experimentation. “Data Lovefest”
  11. Make tools for everyone, including yourself
  12. Build for analytical applications that encourage consumption

 

Update: Mature DS Shops

The laboratory. To succeed with the data lab, companies must create an open, questioning, collaborative environment. They must nurture a critical mass of data scientists and provide them access to lots of data, state-of-the-art tools, and time to dream up and work through hundreds of hypotheses — most of which will not yield insight.

The factory. The work of creating a product or service from an insight, figuring out how to deliver and support it, scaling up to do so, dealing with special cases and mistakes, and doing so at profit is beyond the scope of the lab. It calls for a sense of urgency; discipline and coordination; project plans and schedules; and higher levels of automation and repeatability. The work requires many more people with a wider variety of skill sets, a more rigid environment, and different sorts of metrics.

http://blogs.hbr.org/cs/2013/04/two_departments_for_data_succe.html

What side you on? Blue Team or Red Team? OSS Security Distros

iPython Notebooks with Redis to store lists, sets, and python objects using pickle

Follow

Get every new post delivered to your Inbox.