BigSnarf blog

Infosec FTW

Probabilistic Data Structures for Data Analytics

probabilistic-sizes

Reading this blog and about the tech and it looks like my experiments with Bloom Filters, murmurhash3, iPython Notebook and Redis will come together nicely.

Wikipedia says that in computer sciencestreaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a summary or “sketch” of the data stream in memory.

Use Cases for monitoring counts on anything and for network monitoring

  • Network Login counts
  • Failed attempts per user
  • Failed attempts per groups
  • Failed attempts per role
  • Success counts for above
  • Passwords reset volumes per day, month, year
  • Counts for credentials per person
  • Password age
  • Password change day counts
  • Password lengths
  • User accounts counts for overall issued
  • Time elapsed for provision
  • Time elapsed for decommission
  • Time elapsed for authorization for changes
  • Number of privilege accounts per person
  • Infection counts per user
  • Infection counts per machine
  • Infection counts per IP
  • New account provisioning counts per hour, day, week, month, year
  • Success and failed for each IP per user counts
  • Counts of logins devices
  • Counts of login unique destinations
  • Packet Counts
  • Port Counts
  • DNS request counts per host
  • DNS over all
  • DNS request to internal devices
  • DNS request for each device
  • Per device aggregation of all types of traffic
  • Comparing the increase of the number of DNS requests per second with respect to the average number of DNS requests per second
  • DHCP request counts
  • Segment DHCP counts for lease requests
  • Availability
  • Packet Delay
  • Packet Reordering
  • Packet Loss
  • Packet Inter-arrival Jitter
  • Types of packets counters for each host
  • Bandwidth Measurements (Capacity, Achievable Throughputs)
  • Counts for twitter per user
  • Counts of tweets from user to user
  • Counts of uses of words in tweets
  • Counts of uses of hashtag in tweets
  • Counts of uses of any word or hashtag from specific locations
  • Device counts
  • Software counts
  • Application patch level counts
  • Active user counts
  • Inactive user counts
  • Remote login per country counts
  • Remote login per IP address counts
  • Website visit counts per user
  • Email counts
  • Email attachment counts
  • SPAM counts
  • Statistics for developer
  • Stats on access per application, IP address, service, user

Proposed Implementation

Proposed architecture of an example real time processing and monitoring solution would consist of two modules: the on
line streaming module and the statistical estimation module. The online streaming module is updated upon each packet arrival. Real time tracking of summary information in network traffic is crucial for many network functions such as network monitoring and traffic engineering.

Threshold Analysis

This proposed system will concentrate on two types of threshold analysis:

1) instant thresholds

2) time series thresholds

*****************************************************************************************

Here is the blog…http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Moar reading: http://www.addthis.com/blog/2012/03/26/probabilistic-counting/#.URTELlrjn_W

DNS Metrics Security: http://www.gcsec.org/sites/default/files/doc/D3%20DNS%20Metric%20Use%20Cases.pdf

Network: http://www.bell-labs.com/user/erranlli/publications/cardInfocom09.pdf

Rational: http://www.liquidmatrix.org/blog/2012/02/21/we-are-losing/

Redis backed bitmaps: http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/

Redis Ruby HyperLogLog Github: https://github.com/aaw/hyperloglog-redis

HLL and DB: http://blog.aggregateknowledge.com/2013/02/04/open-source-release-postgresql-hll/

HLL visualization: http://www.aggregateknowledge.com/science/blog/hll.html

Clearspring: https://github.com/clearspring/stream-lib

Python Bayes and Probs: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

All-in-one StatsD etc: https://github.com/monigusto/vagrant-monigusto

http://nbviewer.ipython.org/github/ctb/2013-pycon-awesome-big-data-algorithms/blob/master/03-hyper-log-log-counter.ipynb

One response to “Probabilistic Data Structures for Data Analytics

  1. Pingback: Use cases for probabilistic data structures in Infosec metrics | BigSnarf blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: