BigSnarf blog

Infosec FTW

RateMyView

Screen Shot 2018-05-21 at 12.42.48 PMScreen Shot 2018-05-21 at 12.42.41 PMScreen Shot 2018-05-21 at 12.42.35 PMScreen Shot 2018-05-21 at 12.42.28 PM5

Advertisements

Clustering photos for labels

  1. Histograms of RGB
  2. KMeans Histograms
  3. Autoencoder KMeans
  4. Unsupervised Deep Embedding for Clustering Analysis (DEC)

I have 50000 photos that I would like to label for training a classifier. I guess I could represent each image by raw pixels or RGB values but how do I divide them into K groups in terms of inherent latent semantics? Solutions 1, 2 and 3.

The traditional way, you first extract feature vectors according to domain-specific knowledge and then use a clustering algorithm on the extracted features.

My colleague said I had to use deep learning so I researched it and found DEC. A unified framework which can directly cluster images with linear performance. This new category of clustering algorithms using Deep Learning is typically called Deep Clustering.

From the paper:

Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

https://arxiv.org/pdf/1511.06335.pdf

Screen Shot 2018-05-15 at 10.06.49 PM

Screen Shot 2018-05-15 at 10.18.34 PM

https://github.com/fferroni/DEC-Keras/blob/master/keras_dec.py

https://github.com/XifengGuo/DEC-keras/blob/master/DEC.py

https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf

https://arxiv.org/abs/1709.08374

https://github.com/panji1990/Deep-subspace-clustering-networks

Visual Vocabulary

Designing with data

There are so many ways to visualize data – how do we know which one to pick? Use the categories across the top to decide which data relationship is most important in your story, then look at the different types of chart within the category to form some initial ideas about what might work best. This list is not meant to be exhaustive, nor a wizard, but is a useful starting point for making informative and meaningful data visualizations.

 

https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary

Kaggle Vanity

Blade Runner Principle

giphy

Malware  < ————————————————————————— > Detector

Generator  < ————————————————————————- > Discriminator

One network generates candidates and the other evaluates them. Typically, the generative network learns to map from a latent space to a particular data distribution of interest (benignware), while the discriminative network discriminates between instances from the true data distribution and candidates produced by the generator. The generative network’s training objective is to increase the error rate of the discriminative network (i.e., “fool” the discriminator network by producing novel synthesized instances that appear to have come from the true data distribution).

 

Adversarial stuff

malware2vec experiments query and answer

Screen Shot 2017-12-22 at 11.57.42 PMScreen Shot 2017-12-22 at 11.58.38 PM

13388_2011_1_MOESM3_ESM

linear-relationships

Screen Shot 2017-12-25 at 11.31.36 PM

Screen Shot 2017-12-27 at 9.52.46 PM

Screen Shot 2018-01-10 at 1.54.27 PM

The proposed method converts the strings, and opcode sequences extracted from the malware into vectors and calculates the similarities between vectors. In addition, we apply the proposed method to the execution traces extracted through dynamic analysis, so that malware employing detection avoidance techniques such as obfuscation and packing can be analyzed.  Instructions and instructions frequencies can be modeled into vectors. Call sequences can be modeled. PE sections, DLLs, opcode stats as BOW can be modeled into vectors. Name of files, system calls, API can be vectorized.

Motivation: https://code.google.com/archive/p/word2vec/

https://arxiv.org/pdf/1801.02950.pdf

http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

https://cs224d.stanford.edu/lecture_notes/notes1.pdf

https://arxiv.org/pdf/1709.07470.pdf

entropy based analysis and testing malware

hmm based analysis and testing for malware detection

http://www.mecs-press.org/ijisa/ijisa-v8-n4/IJISA-V8-N4-2.pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.645.9508&rep=rep1&type=pdf

http://ieeexplore.ieee.org/document/7275913/

 

  • Static Malware Analysis
  • Dynamic Malware Analysis

https://dl.acm.org/citation.cfm?id=1007518

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.29&rep=rep1&type=pdf

https://jon.thackray.org/biochem/dna.html

https://arxiv.org/pdf/1104.3229.pdf

We present a novel system for automatically discovering and interactively visualizing shared system call sequence relationships within large malware datasets. Our system’s pipeline begins with the application of a novel heuristic algorithm for extracting variable length, semantically meaningful system call sequences from malware system call behavior logs. Then, based on the occurrence of these semantic sequences, we construct a Boolean vector representation of the malware sample corpus. Finally we compute Jaccard indices pairwise over sample vectors to obtain a sample similarity matrix.

Stacked DAE for malware https://arxiv.org/abs/1711.08336

https://github.com/jivoi/awesome-ml-for-cybersecurity

Automatic malware signature generation and classification. The method uses a deep stack of denoising autoencoders, generating an invariant compact representation of the malware behavior. While conventional signature and token based methods for malware detection do not detect a majority of new variants for existing malware, the results presented in this paper show that signatures generated by the DBN allow for an accurate classification of new malware variants.

https://github.com/yuvalapidot/DeepSign—Deep-Learning-algorithm/tree/master/dl

Dataset

virtualized dynamic analysis to yield program run-time traces of both benign and malicious files.

Screen Shot 2017-12-22 at 12.19.00 AM

https://github.com/wapiflapi/veles

http://ieeexplore.ieee.org/document/8027024/

class imbalance

http://www.edwardraff.com/publications/raff_shwel.pdf

GAN idea – Generative adversarial network opcode

Generative adversarial network for opcode – altering the malware code to resemble benignware by injection subroutines from normal files to cause a rise in misdetection

Kaggle Malware Classification Challenge 2015

https://www.kaggle.com/c/malware-classification/

simhash http://www.wwwconference.org/www2007/papers/paper215.pdf

Machine learning is a popular approach to signatureless malware detection because it can generalize to never-beforeseen malware families and polymorphic strains. This has resulted in its practical use for either primary detection engines or supplementary heuristic detections by anti-malware vendors. Recent work in adversarial machine learning has shown that models are susceptible to gradient-based and other attacks. In this whitepaper, we summarize the various attacks that have been proposed for machine learning models in information security, each which require the adversary to have some degree of knowledge about the model under attack. Importantly, even when applied to attacking machine learning malware classifier based on static features for Windows portable executable (PE) files, these attacks, previous attack methodologies may break the format or functionality of the malware. We investigate a more general framework for attacking static PE anti-malware engines based on reinforcement learning, which models more realistic attacker conditions, and subsequently has provides much more modest evasion rates. A reinforcement learning (RL) agent is equipped with a set of functionality-preserving operations that it may perform on the PE file. It learns through a series of games played against the anti-malware engine which sequence of operations is most likely to result in evasion for a given malware sample. Given the general framework, it is not surprising that the evasion rates are modest. However, the resulting RL agent can succinctly summarize blind spots of the anti-malware model. Additionally, evasive variants generated by the agent may be used to harden machine learning anti-malware engine via adversarial training

https://arxiv.org/abs/1702.05983

https://github.com/wapiflapi/veles

https://github.com/wapiflapi/binglide

http://www.capstone-engine.org/

https://github.com/radare/radare2

https://github.com/vivisect/vivisect

https://cuckoosandbox.org/

https://github.com/programa-stic/barf-project

https://arxiv.org/pdf/1801.02950.pdf

Screen Shot 2017-12-23 at 12.43.46 AM

SICK LiDAR for xmas fun

Denoising AutoEncoder

Screen Shot 2017-12-13 at 11.51.27 PMScreen Shot 2017-12-13 at 11.51.20 PMScreen Shot 2017-12-11 at 3.57.33 PM

Screen Shot 2017-12-08 at 3.49.33 PM.png

latent-space

 

DEMO: http://vecg.cs.ucl.ac.uk/Projects/projects_fonts/projects_fonts.html#interactive_demo

https://github.com/ramarlina/DenoisingAutoEncoder

https://github.com/Mctigger/KagglePlanetPytorch

https://github.com/fducau/AAE_pytorch

https://blog.paperspace.com/adversarial-autoencoders-with-pytorch/

http://pytorch.org/docs/master/torchvision/transforms.html

https://arxiv.org/abs/1612.04642

  • model that predicts  – “autoencoder” as a feature generator
  • model that predicts  – “incidence angle” as a feature generator

Screen Shot 2017-12-09 at 1.16.48 PM

List and Dicts to Pandas DF