BigSnarf blog

Infosec FTW

Monthly Archives: February 2013

Wakari Screenshot

Screen Shot 2013-02-18 at 9.25.57 AM

Wakari is Continuum Analytics’ hosted data analysis environment.

Advertisements

Classifying malicious DNS with 22 features using Random Forests presentation

HMAC One Time Passwords with Twilio

Venn explanation of data scientists by Drew

drewconways

PLY parsing presentation picture

Redis, cPickle, iPython, linkedin scraper, parsing data

Screen Shot 2013-02-10 at 7.32.39 PM

https://github.com/bigsnarfdude/webscraping/blob/master/linkedin_scaper.py

Building parsers and I didn’t even know it. I was reviewing search terms used that people use to find my blogs. I’ve never really understood building a parser, but I understood taking an input and doing something with it. I guess I will formally learn what parsing is this week. http://sigusr2.net/2011/Apr/18/parser-combinators-made-simple.html http://www.mollypages.org/page/grammar/index.mp

String Patterns

Finding and specifying classes of strings using regular expressions

Lexical Analysis

Breaking strings down into important words

Grammars

Specifying and deconstructing valid sentences

Parsing

Turning sentences into trees

http://www.youtube.com/watch?v=6TmNX1ZON6k&list=ECBF6FC32358457242

Parser_Flow

Other Python Parsing Tools

Check out these links for more information about parsing in Python:

Using Counts for Data Analysis – examples

IMG_0393 IMG_0394 IMG_0395 IMG_0396

Probabilistic Data Structures for Data Analytics

probabilistic-sizes

Reading this blog and about the tech and it looks like my experiments with Bloom Filters, murmurhash3, iPython Notebook and Redis will come together nicely.

Wikipedia says that in computer sciencestreaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a summary or “sketch” of the data stream in memory.

Use Cases for monitoring counts on anything and for network monitoring

  • Network Login counts
  • Failed attempts per user
  • Failed attempts per groups
  • Failed attempts per role
  • Success counts for above
  • Passwords reset volumes per day, month, year
  • Counts for credentials per person
  • Password age
  • Password change day counts
  • Password lengths
  • User accounts counts for overall issued
  • Time elapsed for provision
  • Time elapsed for decommission
  • Time elapsed for authorization for changes
  • Number of privilege accounts per person
  • Infection counts per user
  • Infection counts per machine
  • Infection counts per IP
  • New account provisioning counts per hour, day, week, month, year
  • Success and failed for each IP per user counts
  • Counts of logins devices
  • Counts of login unique destinations
  • Packet Counts
  • Port Counts
  • DNS request counts per host
  • DNS over all
  • DNS request to internal devices
  • DNS request for each device
  • Per device aggregation of all types of traffic
  • Comparing the increase of the number of DNS requests per second with respect to the average number of DNS requests per second
  • DHCP request counts
  • Segment DHCP counts for lease requests
  • Availability
  • Packet Delay
  • Packet Reordering
  • Packet Loss
  • Packet Inter-arrival Jitter
  • Types of packets counters for each host
  • Bandwidth Measurements (Capacity, Achievable Throughputs)
  • Counts for twitter per user
  • Counts of tweets from user to user
  • Counts of uses of words in tweets
  • Counts of uses of hashtag in tweets
  • Counts of uses of any word or hashtag from specific locations
  • Device counts
  • Software counts
  • Application patch level counts
  • Active user counts
  • Inactive user counts
  • Remote login per country counts
  • Remote login per IP address counts
  • Website visit counts per user
  • Email counts
  • Email attachment counts
  • SPAM counts
  • Statistics for developer
  • Stats on access per application, IP address, service, user

Proposed Implementation

Proposed architecture of an example real time processing and monitoring solution would consist of two modules: the on
line streaming module and the statistical estimation module. The online streaming module is updated upon each packet arrival. Real time tracking of summary information in network traffic is crucial for many network functions such as network monitoring and traffic engineering.

Threshold Analysis

This proposed system will concentrate on two types of threshold analysis:

1) instant thresholds

2) time series thresholds

*****************************************************************************************

Here is the blog…http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Moar reading: http://www.addthis.com/blog/2012/03/26/probabilistic-counting/#.URTELlrjn_W

DNS Metrics Security: http://www.gcsec.org/sites/default/files/doc/D3%20DNS%20Metric%20Use%20Cases.pdf

Network: http://www.bell-labs.com/user/erranlli/publications/cardInfocom09.pdf

Rational: http://www.liquidmatrix.org/blog/2012/02/21/we-are-losing/

Redis backed bitmaps: http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/

Redis Ruby HyperLogLog Github: https://github.com/aaw/hyperloglog-redis

HLL and DB: http://blog.aggregateknowledge.com/2013/02/04/open-source-release-postgresql-hll/

HLL visualization: http://www.aggregateknowledge.com/science/blog/hll.html

Clearspring: https://github.com/clearspring/stream-lib

Python Bayes and Probs: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

All-in-one StatsD etc: https://github.com/monigusto/vagrant-monigusto

http://nbviewer.ipython.org/github/ctb/2013-pycon-awesome-big-data-algorithms/blob/master/03-hyper-log-log-counter.ipynb

Cloudera Impala for Real Time Queries in Hadoop

Scalable Netflow Analysis with Hadoop

Screen Shot 2013-01-31 at 11.23.00 PM

Screen Shot 2013-01-31 at 11.26.33 PM

http://www.cert.org/flocon/2013/presentations/lee-yeonhee-scalable-netflow-analysis-hadoop.pdf

Apparently the authors above have a Patent Pending? 

Abstract: The present invention relates to a packet analysis system and method, which enables cluster nodes to process in parallel a large quantity of packets collected in a network in an open source distribution system called Hadoop. The packet analysis system based on a Hadoop framework includes a first module for distributing and storing packet traces in a distributed file system, a second module for distributing and processing the packet traces stored in the distributed file system in a cluster of nodes executing Hadoop using a MapReduce method, and a third module for transferring the packet traces, stored in the distributed file system, to the second module so that the packet traces can be processed using the MapReduce method and outputting a result of analysis, calculated by the second module using the MapReduce method, to the distributed file system.