BigSnarf blog

Infosec FTW

Monthly Archives: April 2013

Google Analytics Report Metrics

Screen Shot 2013-04-30 at 10.35.41 PM

 

Screen Shot 2013-04-25 at 10.52.25 PM

 Browser  Visitors  Browser / Device  ga:browser The browsers used by visitors to your website. Both
 Browser Version  Visitors  Browser / Device  ga:browserVersion The browser versions used by visitors to your website. Both
 City  Visitors  Geo / Network  ga:city The cities from which visits originated, based on IP address. Both
 Continent  Visitors  Geo / Network  ga:continent The continents from which visits originated, based on IP address. Both
 Count of Visits  Visitors  Visitor  ga:visitCount The total count of visits to your site.If you’re using this dimension to create a visitor segment (e.g., for a remarketing list), then you’re identifying all visitors over all sessions who match the criteria you apply; for example, all visitors who have more than 5 visits. Both
 Country / Territory  Visitors  Geo / Network  ga:country The countries from which visits originated, based on IP address. Both
 Days Since Last Visit  Visitors  Visitor  ga:daysSinceLastVisit The number of days elapsed since visitors last visited the site. Both
 Domain  Visitors  Geo / Network  ga:networkDomain The fully qualified domain names of your visitors’ Internet service providers (ISPs). Both
 Flash Version  Visitors  Browser / Device  ga:flashVersion The versions of Flash supported by visitors’ browsers, including minor versions. Both
 Java Support  Visitors  Browser / Device  ga:javaEnabled Differentiates visits from browsers with and without (Yes or No) Java enabled. Both
 Language  Visitors  Browser / Device  ga:language A screen through which visitors enter an app. Both
 Metro  Visitors  Geo / Network  ga:metro The Designated Market Area (DMA) from where traffic arrived on your site. Both
 Mobile (Including Tablet)  Visitors  Browser / Device  ga:isMobile Indicates whether visits were from mobile and tablet devices (Yes) or not (No). Both
 Mobile Device Branding  Visitors  Browser / Device  ga:mobileDeviceBranding Manufacturer or branded name (examples: Samsung, HTC, Verizon, T-Mobile). Both
 Mobile Device Info  Visitors  Browser / Device  ga:mobileDeviceInfo The branding, model, and marketing name used to identify the device. Custom reports
 Mobile Device Marketing Name  Visitors  Browser / Device Marketing name used for device (example: Pearl (Blackberry)) Both
 Mobile Device Model  Visitors  Browser / Device  ga:MobileDeviceModel Device model (example: Nexus S) Both
 Mobile Input Selector  Visitors  Browser / Device  ga:mobileInputSelector Selector used on device (examples: touchscreen, joystick, clickwheel, stylus) Both
 Operating System  Visitors  Browser / Device  ga:operatingSystem The operating systems used by visitors to your website. Includes mobile operating systems such as Android. Both
 Operating System Version  Visitors  Browser / Device  ga:operatingSystemVersion The operating system versions used by visitors to your website. Both
 Region  Visitors  Geo / Network  ga:region The geographic regions from which visits originated, based on IP address. Both
 Screen Colors  Visitors  Browser / Device  ga:screenColors The screen color depths of visitors’ monitors. Both
 Screen Resolution  Visitors  Browser / Device  ga:screenResolution The screen resolutions of visitors’ monitors. Both
 Service Provider  Visitors  Geo / Network  ga:networkLocation The names of the Internet service providers (ISPs) used by visitors to your site. Both
 Sub Continent Region  Visitors  Geo / Network  ga:subContinent The sub-continents from which visits originated, based on IP address. Both
 Tablet  Visitors  Browser / Device The tablets used by visitors to your website. Advanced Segments
 Visitor Type  Visitors  Visitor  ga:visitorType New Visitor (first-time visit) and Returning Visitor. Both
 Medium  Traffic Sources  Traffic Sources  ga:medium The mediums which referred traffic. Includes mediums identified via utm_medium. Both
 Referral Path  Traffic Sources  Traffic Sources  ga:referralPath The URIs that referred traffic. Both
 Source  Traffic Sources  Traffic Sources  ga:source The sources which referred traffic. Includes sources identified via utm_source. Both
 Source / Medium  Traffic Sources  Traffic Sources The source-combinations which referred traffic. Includes sources and mediums identified via utm_source and utm_medium. Custom reports
 Traffic Type  Traffic Sources  Traffic Sources The types of traffic to your site: search, referral, direct, and other. Custom reports
 Display Name  Social  Social Activities  ga:socialActivityDisplayName Social activity display name. Custom reports
 Endorsing URL  Social  Social Activities  ga:socialActivityEndorsingUrl For a social data hub activity, this value represents the URL of the social activity (e.g. the Google+ post URL, the blog comment URL, etc.). Custom reports
 Originating Social Action  Social  Social Activities Originating Social Action — The social action associated with the activity (e.g. vote, comment) Custom reports
 Shared URL  Social  Social Activities  ga:socialActivityContentUrl Social Content URL — The URL/content that was talked about in the social activity. Custom reports
 Social Action  Social  Social Activities  ga:socialActivityAction The social action that occurred (e.g. +1, Like, Share) Custom reports
 Social Activity Post  Social  Social Activities  ga:socialActivityPost Social Activity Post — The content of the activity shared by the user. Custom reports
 Social Activity Timestamp  Social  Social Activities  ga:socialActivityTimestamp The timestamp of when the social activity occurred. Custom reports
 Social Entity  Social  Social Interactions  ga:socialInteractionTarget The page (i.e. URL) or entity that was shared. Custom reports
 Social Network  Social  Traffic Sources  ga:socialNetwork The social network where the visit came from and/or the social activity occurred. Custom reports
 Social Network and Action  Social  Social Activities  ga:socialActivityNetworkAction Originating Social Network/Action: The social network where the activity originated and the type of action taken. Custom reports
 Social Source  Social  Social Interactions  ga:socialInteractionNetwork The social source or network on which the activity occurred (e.g. Facebook, Twitter, Google). Custom reports
 Social Source and Action  Social  Social Interactions  ga:socialInteractionNetworkAction The social source/network and action that occurred (e.g. Facebook-Like). Custom reports
 Social Source Referral  Social  Traffic Sources  ga:hasSocialSourceReferral Whether or not this activity resulted from a social source. Custom reports
 Social Tags Summary  Social  Social Activities  ga:socialActivityTagsSummary For a social data hub activity, this is a comma-separated set of tags associated with the social activity. Custom reports
 Social Type  Social  Social Interactions Either “Socially Engaged” or “Not Socially Engaged”. Custom reports
 Social User Handle  Social  Social Activities  ga:socialActivityUserHandle Social User Handle — The handle of the user who initiated the social activity. Custom reports
 User Photo URL  Social  Social Activities  ga:socialActivityUserPhotoUrl URL for the profile photo of the user who performed a social action. Custom reports
 User Profile URL  Social  Social Activities  ga:socialActivityUserProfileUrl URL for the profile of the user who performed a social action. Custom reports
 Connection Speed  Other  Site Speed The network connection speeds of visitors to your website. Both
 Date*  Other  Time  ga:date The dates of the active date range.* (Same as “Visit Date (YYYYMMDD)” in Advanced Segments) Custom reports
 Day of Week  Other  Time  ga:dayOfWeek The day of the week. A one-digit number from 0 (Sunday) to 6 (Monday). Custom reports
 Hour  Other  Time  ga:hour A two-digit hour of the day ranging from 00-23 in the timezone configured for the account. This value is also corrected for daylight savings time, adhering to all local rules for daylight savings time. If your timezone follows daylight savings time, there will be an apparent bump in the number of visits during the change-over hour (e.g. between 1:00 and 2:00) for the day per year when that hour repeats. A corresponding hour with zero visits will occur at the opposite changeover. (Google Analytics does not track visitor time more precisely than hours.) Both
 Hour of Day  Other  Time Date and hour. Custom reports
 Month of Year  Other  Time  ga:month The month of the visit. A two digit integer from 01 to 12. Custom reports
 Visit Date (YYYYMMDD)  Other  Time  ga:date Visit date in yyyymmdd format.*(Same as “Date” in Custom Reports) Advanced Segments
 Week of Year  Other  Time  ga:week The week of the visit. A two-digit number from 01 to 53. Each week starts on Sunday. Custom reports
 Custom Variable (Key 1…n)  Custom Variables  Custom Variables  ga:customVarName(n) The key name of the custom variable fro that slot. Both
 Custom Variable (Value 1…n)  Custom Variables  Custom Variables  ga:customVarValue(n) The value name of the custom variable fro that slot. Both
 Affiliation  Conversions  Ecommerce  ga:affiliation The affiliations assigned to ecommerce transactions. Both
 Days to Transaction  Conversions  Ecommerce  ga:daysToTransaction The number of days between users’ purchases and the campaign referral. Both
 Goal Completion Location  Conversions  Goal Conversions Goal Request URI Custom reports
 Goal Previous Step – 3/2/1  Conversions  Goal Conversions The URI that was loaded 1, 2, or 3 steps prior to the goal completion location. Custom reports
 Product  Conversions  Ecommerce  ga:productName The product names of items sold. Both
 Product Category  Conversions  Ecommerce  ga:productCategory The categories of products sold. Both
 Product SKU  Conversions  Ecommerce  ga:productSku The product codes of items sold. Custom reports
 Transaction  Conversions  Ecommerce  ga:transactionId The transaction IDs of the ecommerce transactions. Custom reports
 Visits to Transaction  Conversions  Ecommerce  ga:visitsToTransaction The number of visits from referral to purchase. Both
 App ID  Content  App Tracking The individual app ID designated by a specific app marketplace, like GooglePlay or another AppStore. Both
 App Installer ID  Content  App Tracking The name or package name for a specific app marketplace, like GooglePlay or another AppStore. Both
 App Name  Content  App Tracking The official name of your app as it’s designated by the developer in your account (e.g., MyApp: Special Edition). Both
 App Version  Content  App Tracking The version number of an app (e.g., 1.5). Both
 Destination Page  Content  Internal Search  ga:searchDestinationPage A page that the user visited after performing an internal website search. Custom reports
 Event Action  Content  Event Tracking  ga:eventAction The actions that were assigned to triggered events. Both
 Event Category  Content  Event Tracking  ga:eventCategory The categories that were assigned to triggered events. Both
 Event Label  Content  Event Tracking  ga:eventLabel The optional labels used to describe triggered events. Both
 Exception Description  Content  App Tracking The description of an exception, or technical error, as defined by your developer in your App Tracking code. Both
 Exit Page  Content  Page Tracking  ga:exitPagePath The pages visitors viewed last on your site. Both
 Exit Screen  Content  App Tracking A screen from which visitors exit an app. Custom reports
 Experiment ID  Content  Content Experiments Visits by people who saw the experiment page. Advanced Segments
 Full Referrer  Content  Traffic Sources The URLs that referred traffic. Custom reports
 Hostname  Content  Page Tracking  ga:hostname The hostnames visitors used to reach your site. Typically, your site’s URL. Both
 Landing Page  Content  Page Tracking  ga:landingPagePath The pages through which visitors entered your site. Both
 Landing Screen  Content  App Tracking A screen through which visitors enter an app. Custom reports
 Page  Content  Page Tracking  ga:pagePath The pages visited, listed by path and/or query parameters. Both
 Page Depth  Content  Page Tracking  ga:pageDepth The number of pages viewed by visitors in a session. Both
 Page path level 4/3/2/1  Content  Page Tracking  ga:pagePathLevel1 Page Path Level 1, 2, 3 or 4. Custom reports
 Page Title  Content  Page Tracking  ga:pageTitle The page titles used on your site. Both
 Refined Keyword  Content  Internal Search  ga:searchKeywordRefinement The search terms used to refine internal searches. Both
 Screen Depth  Content  Internal Search The number of screens viewed in a session. Both
 Screen Name  Content  App Tracking The name of a specific app screen. Custom reports
 Search Term  Content  Internal Search  ga:searchKeyword The search terms used by visitors to search your site. Both
 Site Search Category  Content  Internal Search  ga:searchCategory The categories searched by visitors searching your site. Both
 Site Search Status  Content  Internal Search  ga:searchUsed Distinguishes visits that included an internal site search and those that did not. Both
 Start Page  Content  Internal Search  ga:searchStartPage The pages from which visitors searched your site. Custom reports
 Timing Category  Content  User Timings  ga:userTimingCategory User specified category for user timing. Custom reports
 Timing Label  Content  User Timings  ga:userTimingLabel User specified label for user timing. Custom reports
 Timing Variable  Content  User Timings  ga:userTimingVariable User timing variable Custom reports
 User Defined Value  Content  Visitor  ga:userDefinedValue The value provided when you define custom visitor segments for your website. Both
 Variation  Content  Content Experiments Visits for a specific combination of pages in an experiment; for example: Variation Page A and Goal A; Variation Page B and Goal A. Advanced Segments
 Ad Content  Advertising  AdWords  ga:adContent The first line of each AdWords ad and the utm_content tags that were used in tagged campaigns. Both
 Ad Distribution Network  Advertising  AdWords  ga:adDistributionNetwork The location where your ad was shown (google.com, search partners, content network). Custom reports
 Ad Group  Advertising  AdWords  ga:adGroup The names of your AdWords ad groups. Both
 Ad Slot  Advertising  AdWords  ga:adSlot The location of the advertisement on the hosting page (Top, RHS, or not set). Both
 Ad Slot Position  Advertising  AdWords  ga:adSlotPostition The ad slot positions in which your AdWords ads appeared (1-8). Both
 Campaign  Advertising  AdWords  TV campaign Custom reports
 Campaign  Advertising  AdWords  ga:campaign The names of your AdWords campaigns and the utm_campaign tags that were used in tagged campaigns. Both
 Destination URL  Advertising  AdWords  ga:adDestinationUrl The URLs to which your AdWords ads referred traffic. Custom reports
 Keyword  Advertising  Traffic Sources  ga:keyword All keywords, both paid and unpaid, used by users to reach your site. Both
 Match Type  Advertising  AdWords  ga:adMatchType How the keyword was matched the query (i.e. exact, broad, phrase). Custom reports
 Matched Search Query  Advertising  AdWords  ga:adMatchedQuery The actual search queries that triggered impressions of your AdWords ads. Custom reports
 Placement Domain  Advertising  AdWords  ga:adPlacementDomain The domains where your ads on the content network were placed. Custom reports
 Placement Type  Advertising  AdWords  ga:adTargetingOption Automatic placement or managed placement. Custom reports
 Placement URL  Advertising  AdWords  ga:adPlacementUrl The URLs where your ads on the content network were placed. Custom reports
 Social Annotation Type  Advertising  AdWords The type of +1 annotation made to your ads: None, Basic, or Personal. Custom reports

Cloudera Data Science Essentials Training

pb_visualizing_f

Photo http://wikibon.org/blog/data-visualization/

 

Data Science Essentials Exam (DS-200) Preparation

 

Online Data Science Resources


Books


Blogs/misc.


Exam Sections

These are the current DS-200 Data Science Essentials beta exam sections

  1. Data Acquisition
  2. Data Evaluation
  3. Data Transformation
  4. Machine Learning Basics
  5. Clustering
  6. Classification
  7. Collaborative Filtering
  8. Model/Feature Selection
  9. Probability
  10. Visualization
  11. Optimization

Data Acquisition

Objectives

  • Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
  • Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
  • Use command line tools such wget and curl
  • Use Hadoop tools such as Sqoop and Flume

Section Study Resources


Data Evaluation

Objectives

  • Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
  • Methods for working with various file formats including binary files, JSON, XML, and .csv
  • Tools, techniques, and utilities for evaluating data from the command line and at scale
  • An understanding of sampling and filtering techniques
  • A familiarity with Hadoop SequenceFiles and serialization using Avro

Section Study Resources


Data Transformation

Objectives

  • Write a map-only Hadoop Streaming job
  • Write a script that receives records on stdin and write them to stdout
  • Invoke Unix tools to convert file formats
  • Join data sets
  • Write scripts to anonymize data sets
  • Write a Mapper using Python and invoke via Hadoop streaming
  • Write a custom subclass of FileOutputFormat
  • Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat

Section Study Resources


Machine Learning Basics

Objectives

  • Understand how to use Mappers and Reducers to create predictive models
  • Understand the different kinds of machine learning, including supervised and unsupervised learning
  • Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems

Section Study Resources

  • Apache Mahout. Check out the Mahout wiki
  • Cloudera’s blog category on Mahout
  • Hadoop In Practice: Chapter 9
  • Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
  • Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
  • A Programmers Guide to Data Mining

Clustering

Objectives

  • Define clustering and identify appropriate use cases
  • Identify appropriate uses of various models including centroid, distribution, density, group, and graph
  • Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
  • Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)

Section Study Resources

  • Programming Collective Intelligence: Chapter 3
  • Algorithms of the Intelligent Web: Chapter 4
  • Mahout In Action: Part 2

Classification

Objectives

  • Describe the steps for training a set of data in order to identify new data based on known data
  • Identify the use cases for logistic regression, Bayes theorem
  • Define classification techniques and formulas

Section Study Resources

  • Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
  • Algorithms of the Intelligent Web: Chapters 5, 6
  • Mahout In Action: Part 3

Collaborative Filtering

Objectives

  • Identify the use of user-based and item-based collaborative filtering techniques
  • describe the limitations and strengths of collaborative filtering techniques
  • Given a scenario, determine the appropriate collaborative filtering implementation
  • Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system

Section Study Resources


Model/Feature Selection

Objectives

  • Describe the role and function of feature selection
  • Analyze a scenario and determine the appropriate features and attributes to select
  • Analyze a scenario and determine the methods to deploy for optimal feature selection

Section Study Resources

  • Programming Collective Intelligence: Chapter 10
  • Pattern Recognition and Machine Learning: Chapter 1.3

Probability

Objectives

  • Analyze a scenario and determine the likelihood of a particular outcome
  • Determine sample percentiles
  • Determine a range of items based on a sample probability density function
  • Summarize a distribution of sample numbers

Section Study Resources

  • Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
  • Pattern Recognition and Machine Learning: Chapter 2
  • Probability, Statistics, Bayes Theorem at better explained

Visualization

Objectives

  • Determine the most effective visualization for a given problem
  • Analyze a data visualization and interpret its meaning

Section Study Resources


Optimization

Objectives

  • Understand optimization methods
  • Identify 1st order and 2nd order optimization techniques
  • Determine the learning rate for a particular algorithm
  • Determine the sources of errors in a model

Section Study Resources

http://cloudera.com/content/cloudera/en/training/certification/ccp-ds/essentials/prep.html

“Start with a small data project” – Alex Hutton

smallData

 

I recently watched a webinar by Alex Hutton on Security Data Analytics. It reminded me of this HBR post.

To Succeed with Big Data, Start Small

http://blogs.hbr.org/cs/2012/10/to_succeed_with_big_data_start.html

While it isn’t hard to argue the value of analyzing big data, it is intimidating to figure out what to do first. There are many unknowns when working with data that your organization has never used before — the streams of unstructured information from the web, for example. Which elements of the data hold value? What are the most important metrics the data can generate? What quality issues exist? As a result of these unknowns, the costs and time required to achieve success can be hard to estimate.

As an organization gains experience with specific types of data, certain issues will fade, but there will always be another new data source with the same unknowns waiting in the wings. The key to success is to start small. It’s a lower-risk way to see what big data can do for your firm and to test your firm’s readiness to use it.

The Traditional Way

In most organizations, big data projects get their start when an executive becomes convinced that the company is missing out on opportunities in data. Perhaps it’s the CMO looking to glean new insight into customer behavior from web data, for example. That conviction leads to an exhaustive and time-consuming process by which the CMO’s team might work with the CIO’s team to specify and scope the precise insights to be pursued and the associated analytics to get them.

Next, the organization launches a major IT project. The CIO’s team designs and implements complex processes to capture all the raw web data needed and transform it into usable (structured) information that can then be analyzed.

Once analytic professionals start using the data, they’ll find problems with the approach. This triggers another iteration of the IT project. Repeat a few times and everyone will be pulling their hair out and questioning why they ever decided to try to analyze the web data in the first place. This is a scenario I have seen play out many times in many organizations.

A Better Approach

The process I just described doesn’t work for big data initiatives because it’s designed for cases where all the facts are known, all the risks are identified, and all steps are clear — exactly what youwon’t find with a big data initiative. After all, you’re applying a new data source to new problems in a new way.

Again, my best advice is to start small. First, define a few relatively simple analytics that won’t take much time or data to run. For example, an online retailer might start by identifying what products each customer viewed so that the company can send a follow-up offer if they don’t purchase. A few intuitive examples like this allow the organization to see what the data can do. More importantly, this approach yields results that are easy to test to see what type of lift the analytics provide.

Next, instead of setting up formal processes to capture, process, and analyze all of the data all of the time, capture some of the data in a one-off fashion. Perhaps a month’s worth for one division for a certain subset of products. If you capture only the data you need to perform the test, you’ll find the initial data volume easier to manage and you won’t muddy the water with a bunch of other data — a problem that plagues many big data initiatives.

At this point, it is time to turn analytic professionals loose on the data. Remember: they’re used to dealing with raw data in an unfriendly format. They can zero in on what they need and ignore the rest. They can create test and control groups to whom they can send the follow-up offers, and then they can help analyze the results. During this process, they’ll also learn an awful lot about the data and how to make use of it. This kind of targeted prototyping is invaluable when it comes to identifying trouble and firming up a broader effort.

Successful prototypes also make it far easier to get the support required for the larger effort. Best of all, the full effort will now be less risky because the data is better understood and the value is already partially proven. It’s also worthwhile to learn that the initial analytics aren’t as valuable as hoped. It tells you to focus effort elsewhere before you’ve wasted many months and a lot of money.

Pursuing big data with small, targeted steps can actually be the fastest, least expensive, and most effective way to go. It enables an organization to prove there’s value in major investment before making it and to understand better how to make a big data program pay off for the long term.

PCAP analysis via VirusTotal

VirusTotal += PCAP Analyzer

VirusTotal is a greedy creature, one of its gluttonous wishes is to be able to understand and characterize all the races it encounters, it already understood the insurgent collective of Portable Executables, the greenish creatures known as Android APKs, the talkative PDF civilization, etc. as of today it also figures out PCAPs, a rare group of individuals obsessed with recording everything they see.

PCAP files contain network packet data created during a live network capture, often used for packet sniffing and analyzing data network characteristics. In the malware research field PCAPs are often used to:

  • Record malware network communication when executed in sandboxed environments.
  • Record honeyclient browser exploitation traces.
  • Log network activity seen by network appliances and IDS.
  • etc.

http://blog.virustotal.com/2013/04/virustotal-pcap-analyzer.html

Mandelbrot Set

Screen Shot 2013-04-16 at 10.32.38 PM

 

PyOpenCL

PyOpenCL lets you access the OpenCL parallel computation API from Python. Here’s what sets PyOpenCL apart:

  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code.
  • Completeness. PyOpenCL puts the full power of OpenCL’s API at your disposal, if you wish.
  • Convenience. While PyOpenCL’s primary focus is to make all of OpenCL accessible, it tries hard to make your life less complicated as it does so–without taking any shortcuts.
  • Automatic Error Checking. All OpenCL errors are automatically translated into Python exceptions.
  • Speed. PyOpenCL’s base layer is written in C++, so all the niceties above are virtually free.
  • Helpful, complete documentation and a wiki.
  • Liberal licensing (MIT).

Documentation

See the PyOpenCL Documentation.

Support

Having trouble with PyOpenCL? First, you may want to check the PyOpenCL Wiki. If that doesn’t help, maybe the nice people on the PyOpenCL mailing list can.

Download

Download PyOpenCL here.

Or get it directly from my source code repository by typing

git clone http://git.tiker.net/trees/pyopencl.git

You may also browse the source.

Prerequisites: All you need is an OpenCL implementation. And Python obviously.

Data Scientist Roles

continum

http://www.fastcolabs.com/3008620/lessons-crash-course-data-science

Data Science Bootcamp

http://www.bigdive.eu/

THE IDEA BEHIND BIG DIVE IS TO BOOST THE GROWTH OF A NEW GENERATION OF DEVELOPERS.

A street-fighting gym where high value datasets are the raw material in the hands of a bunch of ambitious smart geeks tutored and mentored by experts in three key areas: Development, Visualization and Data Science.

Formatting code in iPython

Extracting features out of web logs to identify Human vs. Robot

robot

Classifying traffic intensity and temporary differences in access

  1. Total pages request per IP address
  2. Percentage of images requested
  3. Percentage of binaries requested like pdf
  4. Total request for robots.txt
  5. Percentage of HTML pages requested
  6. Percentage of text files requested
  7. Percentage of zip files requested
  8. Percentage of video files requested
  9. Bounce rate
  10. Session time
  11. Standard deviation between clicks
  12. Percentage of night time requests
  13. Percentage of errors
  14. Percentage of garbage requests
  15. Percentage of GETS
  16. Percentage of POSTS
  17. Percentage of HEAD
  18. URL traversal
  19. Depth of URL traversal
  20. Pathlength
  21. Referrer
  22. User Agents
  23. IP Address location
  24. Known crawler IP addresses
  25. Repeated requests
  26. Average time between clicks
  27. OS badges
  28. ARIN registration
  29. ASN analysis
  30. Geolocation

Chart SQL joins

sqlJoins

 

.

Security Data Visualization

Screen Shot 2013-02-16 at 8.20.05 AM


Tableau Public is for anyone who wants to tell stories with interactive data on the web. It’s delivered as a service which allows you to be up and running overnight. With Tableau Public you can create amazing interactive visuals and publish them quickly, without the help of programmers or IT.

The Premium version of Tableau Public is for organizations that want to enhance their websites with interactive data visualizations. There are higher limits on the size of data you can work with. And among other premium features, you can keep your underlying data hidden.

Why tell stories with data? Because interactive content drives more page views and longer dwell time. Industry experts have cited figures showing that the average reading time of a web page with an interactive visual is 4, 5 or 6 times that of a static web page.

http://www.tableausoftware.com/products/public