BigSnarf blog

Infosec FTW

Feature Extraction Network Packets Machine Learning

Scikit-learn (sklearn) is an established, open-source machine learning library, written in Python with the help of NumPy, SciPy and Cython.

Scikit-learn is very user friendly, has a consistent API, and provides extensive documentation. Its implementation is high quality due to strict coding standards and high test coverage.  Behind sklearn is a very active community, which is steadily improving the library.

  • How to perform scalable text feature extraction with the Hashing Trick

Feature Extraction of the following features from each network packet

  1. Ethernet Size
  2. Ethernet Destination
  3. Ethernet Source
  4. Ethernet Protocol
  5. IP header length
  6. IP Time To Live
  7. IP Protocol
  8. IP Length
  9. IP Type of Service
  10. IP Source
  11. IP Destination
  12. TCP Source Port
  13. TCP Destination Port
  14. UDP Source Port
  15. UDP Destination Port
  16. UDP Length
  17. ICMP Type
  18. ICMP Code

Other potential feature extractions from packets could be:

  1. Duration of the connection
  2. Connection Starting Time
  3. Connection Ending Time
  4. Number of packets from src to dst
  5. Number of packets from dst to src
  6. Number of bytes from src to dst
  7. Number of byte from dst to src
  8. Number of Fragmented packets
  9. Number of ACK packets
  10. Number of retransmitted packets
  11. Number of pushed packets
  12. Number of SYN packets
  13. Number of FIN packets
  14. Number of TCP header flags
  15. Number of Urgent packets
  16. Number of sequence packets

Network traffic type features:

  1. Per src IP to set(all dst IP) per minute, hour, day, month, year
  2. Per src IP to set(all dst same Port) per minute, hour, day, month, year
  3. Per src IP to set(all dst to different Ports) per minute, hour, day, month, year
  4. Per src IP to set (all dst per protocal like SYN, FIN, ACK) per minute, hour, day, month, year
  5. All reverse stats from dst to src for items 1-4
  6. Conversations per IP per minute, hour, day, month, year
  7. Conversations based on protocol or flag, per MHDY

supervised

Vectorizing a large text corpus with the hashing trick

Screen Shot 2013-04-05 at 8.07.42 PM

http://nbviewer.ipython.org/urls/raw.github.com/bigsnarfdude/machineLearning/master/Vectorizing.ipynb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: