BigSnarf blog

Infosec FTW

Using association rules to build patterns of attacks and normal traffic

In data mining, association rule learning (ARL) is a well researched method for discovering interesting patterns between variables in large datasets. Typical usages of ARL is analyzing transaction data recorded from POS systems. Look at the pattern  {onions -> potatoes => burgers}.  These items or patterns, were found in the sales data of a supermarket. Analysis indicates,  that when customers buy onions and potatoes together, then that customer is likely to also buy burgers. Such information can be used for product placement.  Stores can also put onions on sale hoping to get more sales in potatoes and burgers.

So if we can notice patterns in sales data, why don’t we do it for network attacks and normal traffic? Maybe this is where we start by logging and measuring everything for security metrics and recording trends? Does someone have a compendium or list with associations rules already made up for Infosec?

Creating ARL for noisy and heavy traffic DOS and DDOS attacks is trivial. It becomes very difficult when there are many packets, many systems, many devices, many applications, and many people driving those actions.  Finding an attacker on your network becomes overwhelming.  I have more questions than answers at this point.

Introducing Apache Mahout – Association Rule Algorithm:

Machine learning is used from game playing to fraud detection to stock-market analysis. It’s used to build systems like those at Netflix and Amazon that recommend products to us. It can also be used to categorize Web pages , SPAM email messages, and detect attacks by an IDS. Association rules have been patched for Mahout as an Apriori algorithm leveraging Hadoop and MapReduce. This Apriori algorithm is a very popular algorithm to mine association rules. The Mahout ARL algorithm was born out of research in this paper:

Another possible solution for an open source reliable association rules algorithm:

RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010.  It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004. Source:

Examples of data sets:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: