BigSnarf blog

Infosec FTW

Python implementation of Hyperloglog, redis, fuzzy hashing for malware detection


I was thinking there must be a way to use Hyperloglog with fuzzy hash sets of malware rolling over blocks of 32 or 64 bytes. Stick that into a Redis cluster for persistence of objects for full analysis against rolling hashes of sample malware hash sets.  There would be some error but it would give you a quick answer if a sample binary was a fuzzy match for an identified malware hash set in your datastore.  You could also use this to identify copies of a “Top Secret” document on various systems. I’m having a brain block this morning on determining the inclusion and exclusion comparisons. (Either need more coffee or sleep)

Update: I guess simple lists and sets work directly out of the box for redis. No need for HLL yet. Plus I’m still stuck with trying to figure out intersections. I’m not sure what 1,000,000 item list looks like in memory for set comparisons, but I think they will get garbage collected.

Here’s my code:

Determining Inclusion and Exclusion – WIP


Fuzzy Hashing

Redis Stuff 

WIP (need to build using sliding window hash aka. “rolling hash”)

>>> r = redis.Redis(...)
>>> r.set('bing', 'baz')
>>> # Use the pipeline() method to create a pipeline instance
>>> pipe = r.pipeline()
>>> # The following SET commands are buffered
>>> pipe.set('foo', 'bar')
>>> pipe.get('bing')
>>> # the EXECUTE call sends all buffered commands to the server, returning
>>> # a list of responses, one for each command.
>>> pipe.execute()
[True, 'baz']

One response to “Python implementation of Hyperloglog, redis, fuzzy hashing for malware detection

  1. timonk March 21, 2013 at 6:30 pm

    Thanks for the link to the AK blog! We have a few more things that we’ve written about that might be useful. I cover HLL intersections (but the principle works for any sketch that supports a “lossless union” like HLL does), here and we’ve also built an HLL implementation in Postgres if you wanted to use it in a more feature-rich environment for querying, here

    In general, HLL could be really useful for this, since I imagine the documents you’re comparing are of similar size (within an order of magnitude). Aaron Windsor’s HLL implementation in Redis is great and the README has an interesting section about intersection error, which in practice is correct. I’m working on a more rigorous theoretical explanation for why it is correct. If you just want decent heuristics, my post has those.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: