BigSnarf blog
Infosec FTW
Data Scientist Roles
Posted by on April 15, 2013
http://www.fastcolabs.com/3008620/lessons-crash-course-data-science
Data Science Bootcamp
THE IDEA BEHIND BIG DIVE IS TO BOOST THE GROWTH OF A NEW GENERATION OF DEVELOPERS.
A street-fighting gym where high value datasets are the raw material in the hands of a bunch of ambitious smart geeks tutored and mentored by experts in three key areas: Development, Visualization and Data Science.
Extracting features out of web logs to identify Human vs. Robot
Posted by on April 9, 2013
Classifying traffic intensity and temporary differences in access
- Total pages request per IP address
- Percentage of images requested
- Percentage of binaries requested like pdf
- Total request for robots.txt
- Percentage of HTML pages requested
- Percentage of text files requested
- Percentage of zip files requested
- Percentage of video files requested
- Bounce rate
- Session time
- Standard deviation between clicks
- Percentage of night time requests
- Percentage of errors
- Percentage of garbage requests
- Percentage of GETS
- Percentage of POSTS
- Percentage of HEAD
- URL traversal
- Depth of URL traversal
- Pathlength
- Referrer
- User Agents
- IP Address location
- Known crawler IP addresses
- Repeated requests
- Average time between clicks
- OS badges
- ARIN registration
- ASN analysis
- Geolocation
Security Data Visualization
Posted by on April 8, 2013
Tableau Public is for anyone who wants to tell stories with interactive data on the web. It’s delivered as a service which allows you to be up and running overnight. With Tableau Public you can create amazing interactive visuals and publish them quickly, without the help of programmers or IT.
The Premium version of Tableau Public is for organizations that want to enhance their websites with interactive data visualizations. There are higher limits on the size of data you can work with. And among other premium features, you can keep your underlying data hidden.
Why tell stories with data? Because interactive content drives more page views and longer dwell time. Industry experts have cited figures showing that the average reading time of a web page with an interactive visual is 4, 5 or 6 times that of a static web page.
Feature Extraction Network Packets Machine Learning
Posted by on April 5, 2013

Scikit-learn (sklearn) is an established, open-source machine learning library, written in Python with the help of NumPy, SciPy and Cython.
Scikit-learn is very user friendly, has a consistent API, and provides extensive documentation. Its implementation is high quality due to strict coding standards and high test coverage. Behind sklearn is a very active community, which is steadily improving the library.
- How to perform scalable text feature extraction with the Hashing Trick
Feature Extraction of the following features from each network packet
- Ethernet Size
- Ethernet Destination
- Ethernet Source
- Ethernet Protocol
- IP header length
- IP Time To Live
- IP Protocol
- IP Length
- IP Type of Service
- IP Source
- IP Destination
- TCP Source Port
- TCP Destination Port
- UDP Source Port
- UDP Destination Port
- UDP Length
- ICMP Type
- ICMP Code
Other potential feature extractions from packets could be:
- Duration of the connection
- Connection Starting Time
- Connection Ending Time
- Number of packets from src to dst
- Number of packets from dst to src
- Number of bytes from src to dst
- Number of byte from dst to src
- Number of Fragmented packets
- Number of ACK packets
- Number of retransmitted packets
- Number of pushed packets
- Number of SYN packets
- Number of FIN packets
- Number of TCP header flags
- Number of Urgent packets
- Number of sequence packets
Network traffic type features:
- Per src IP to set(all dst IP) per minute, hour, day, month, year
- Per src IP to set(all dst same Port) per minute, hour, day, month, year
- Per src IP to set(all dst to different Ports) per minute, hour, day, month, year
- Per src IP to set (all dst per protocal like SYN, FIN, ACK) per minute, hour, day, month, year
- All reverse stats from dst to src for items 1-4
- Conversations per IP per minute, hour, day, month, year
- Conversations based on protocol or flag, per MHDY
Vectorizing a large text corpus with the hashing trick
My thoughts on building a security data analytics practice in an organization
Posted by on April 3, 2013
Build People, Processes and Policies
- Gather all the questions that need to be answered
- Select team members
- Develop data preparation workflows
- Select data preparation tools (python, bash, hadoop)
- Develop how you want to consume and present data to users and consumers
- Select data presentation tools (tableau, ipython notebooks, d3.js)
- Develop the experimentation workflow (tools etc)
- Observe and analyze experiment outcomes (gotta build stuff / POC)
- Build data products and optimize (POC => WIP => Prod1.0 => Prod2.0)
- Train anyone and everyone to love you data products
- Build data products you love
Analysis of the current environment
- What questions need to be asked? What questions need to be answered? Who need these answers? How fast?
- Where is your data now, how is it stored, who controls it, how do you get access
- Are you getting the right kinds of data? Is it in the format you want? Is the systems in place answering 90% of questions?
- Consider instrumenting everything
- Consider storing all the data in one place. Figure out how to protect and monitor access.
- Need data to feed the algorithms to feed the peoples questions
- You need to store the data then you can process from unstructured to structured data
- Consume the data you have first before building
- Plan on keeping all the data forever
- Build data products for self service, exploration and experimentation. “Data Lovefest”
- Make tools for everyone, including yourself
- Build for analytical applications that encourage consumption
Update: Mature DS Shops
The laboratory. To succeed with the data lab, companies must create an open, questioning, collaborative environment. They must nurture a critical mass of data scientists and provide them access to lots of data, state-of-the-art tools, and time to dream up and work through hundreds of hypotheses — most of which will not yield insight.
The factory. The work of creating a product or service from an insight, figuring out how to deliver and support it, scaling up to do so, dealing with special cases and mistakes, and doing so at profit is beyond the scope of the lab. It calls for a sense of urgency; discipline and coordination; project plans and schedules; and higher levels of automation and repeatability. The work requires many more people with a wider variety of skill sets, a more rigid environment, and different sorts of metrics.
http://blogs.hbr.org/cs/2013/04/two_departments_for_data_succe.html
If you torture the data long enough, it will confess!
Posted by on April 3, 2013
“If you torture the data long enough, it will confess.” – Ronald Coase
@BigsnarfDude tweet - April 3, 2013
What side you on? Blue Team or Red Team? OSS Security Distros
Posted by on March 29, 2013
REMnux < SIFT Kit < Security Onion < IPCop > Samurai WTF > BackTrack > Kali
Dude where’s my naive bayes?
Posted by on March 28, 2013
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model”. An overview of statistical classifiers is given in the article on Pattern recognition.
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://github.com/bigsnarfdude/machineLearning/blob/master/mason_vs_sklearn_naive_bayes.py









