BigSnarf blog
Infosec FTW
Category Archives: Thoughts
Visualize data to spot the errors
Posted by on May 25, 2013
In the first chart, I plotted data on an aggregation report I built. Because of the visualization, I discovered gaps in the reporting data.
In the second chart, I changed the resolution for hourly, to every three hours and the gap was still there.
The third plot is each data source plotted separately, notice no gaps it the original source data.
What I discovered was a built in flaw in the library I was using to aggregate data or my poorly implemented method. Albeit, I re-implemented a custom aggregator to fix the problem.
I found a related post http://gigaom.com/2013/05/24/steering-clear-of-the-iceberg-three-ways-we-can-fix-the-data-credibilty-crisis-in-science
Hadoop MapReduce Redis Cluster
Posted by on May 11, 2013
Google Analytics Report Metrics
Posted by on April 30, 2013
| Browser | Visitors | Browser / Device | ga:browser | The browsers used by visitors to your website. | Both |
| Browser Version | Visitors | Browser / Device | ga:browserVersion | The browser versions used by visitors to your website. | Both |
| City | Visitors | Geo / Network | ga:city | The cities from which visits originated, based on IP address. | Both |
| Continent | Visitors | Geo / Network | ga:continent | The continents from which visits originated, based on IP address. | Both |
| Count of Visits | Visitors | Visitor | ga:visitCount | The total count of visits to your site.If you’re using this dimension to create a visitor segment (e.g., for a remarketing list), then you’re identifying all visitors over all sessions who match the criteria you apply; for example, all visitors who have more than 5 visits. | Both |
| Country / Territory | Visitors | Geo / Network | ga:country | The countries from which visits originated, based on IP address. | Both |
| Days Since Last Visit | Visitors | Visitor | ga:daysSinceLastVisit | The number of days elapsed since visitors last visited the site. | Both |
| Domain | Visitors | Geo / Network | ga:networkDomain | The fully qualified domain names of your visitors’ Internet service providers (ISPs). | Both |
| Flash Version | Visitors | Browser / Device | ga:flashVersion | The versions of Flash supported by visitors’ browsers, including minor versions. | Both |
| Java Support | Visitors | Browser / Device | ga:javaEnabled | Differentiates visits from browsers with and without (Yes or No) Java enabled. | Both |
| Language | Visitors | Browser / Device | ga:language | A screen through which visitors enter an app. | Both |
| Metro | Visitors | Geo / Network | ga:metro | The Designated Market Area (DMA) from where traffic arrived on your site. | Both |
| Mobile (Including Tablet) | Visitors | Browser / Device | ga:isMobile | Indicates whether visits were from mobile and tablet devices (Yes) or not (No). | Both |
| Mobile Device Branding | Visitors | Browser / Device | ga:mobileDeviceBranding | Manufacturer or branded name (examples: Samsung, HTC, Verizon, T-Mobile). | Both |
| Mobile Device Info | Visitors | Browser / Device | ga:mobileDeviceInfo | The branding, model, and marketing name used to identify the device. | Custom reports |
| Mobile Device Marketing Name | Visitors | Browser / Device | Marketing name used for device (example: Pearl (Blackberry)) | Both | |
| Mobile Device Model | Visitors | Browser / Device | ga:MobileDeviceModel | Device model (example: Nexus S) | Both |
| Mobile Input Selector | Visitors | Browser / Device | ga:mobileInputSelector | Selector used on device (examples: touchscreen, joystick, clickwheel, stylus) | Both |
| Operating System | Visitors | Browser / Device | ga:operatingSystem | The operating systems used by visitors to your website. Includes mobile operating systems such as Android. | Both |
| Operating System Version | Visitors | Browser / Device | ga:operatingSystemVersion | The operating system versions used by visitors to your website. | Both |
| Region | Visitors | Geo / Network | ga:region | The geographic regions from which visits originated, based on IP address. | Both |
| Screen Colors | Visitors | Browser / Device | ga:screenColors | The screen color depths of visitors’ monitors. | Both |
| Screen Resolution | Visitors | Browser / Device | ga:screenResolution | The screen resolutions of visitors’ monitors. | Both |
| Service Provider | Visitors | Geo / Network | ga:networkLocation | The names of the Internet service providers (ISPs) used by visitors to your site. | Both |
| Sub Continent Region | Visitors | Geo / Network | ga:subContinent | The sub-continents from which visits originated, based on IP address. | Both |
| Tablet | Visitors | Browser / Device | The tablets used by visitors to your website. | Advanced Segments | |
| Visitor Type | Visitors | Visitor | ga:visitorType | New Visitor (first-time visit) and Returning Visitor. | Both |
| Medium | Traffic Sources | Traffic Sources | ga:medium | The mediums which referred traffic. Includes mediums identified via utm_medium. | Both |
| Referral Path | Traffic Sources | Traffic Sources | ga:referralPath | The URIs that referred traffic. | Both |
| Source | Traffic Sources | Traffic Sources | ga:source | The sources which referred traffic. Includes sources identified via utm_source. | Both |
| Source / Medium | Traffic Sources | Traffic Sources | The source-combinations which referred traffic. Includes sources and mediums identified via utm_source and utm_medium. | Custom reports | |
| Traffic Type | Traffic Sources | Traffic Sources | The types of traffic to your site: search, referral, direct, and other. | Custom reports | |
| Display Name | Social | Social Activities | ga:socialActivityDisplayName | Social activity display name. | Custom reports |
| Endorsing URL | Social | Social Activities | ga:socialActivityEndorsingUrl | For a social data hub activity, this value represents the URL of the social activity (e.g. the Google+ post URL, the blog comment URL, etc.). | Custom reports |
| Originating Social Action | Social | Social Activities | Originating Social Action — The social action associated with the activity (e.g. vote, comment) | Custom reports | |
| Shared URL | Social | Social Activities | ga:socialActivityContentUrl | Social Content URL — The URL/content that was talked about in the social activity. | Custom reports |
| Social Action | Social | Social Activities | ga:socialActivityAction | The social action that occurred (e.g. +1, Like, Share) | Custom reports |
| Social Activity Post | Social | Social Activities | ga:socialActivityPost | Social Activity Post — The content of the activity shared by the user. | Custom reports |
| Social Activity Timestamp | Social | Social Activities | ga:socialActivityTimestamp | The timestamp of when the social activity occurred. | Custom reports |
| Social Entity | Social | Social Interactions | ga:socialInteractionTarget | The page (i.e. URL) or entity that was shared. | Custom reports |
| Social Network | Social | Traffic Sources | ga:socialNetwork | The social network where the visit came from and/or the social activity occurred. | Custom reports |
| Social Network and Action | Social | Social Activities | ga:socialActivityNetworkAction | Originating Social Network/Action: The social network where the activity originated and the type of action taken. | Custom reports |
| Social Source | Social | Social Interactions | ga:socialInteractionNetwork | The social source or network on which the activity occurred (e.g. Facebook, Twitter, Google). | Custom reports |
| Social Source and Action | Social | Social Interactions | ga:socialInteractionNetworkAction | The social source/network and action that occurred (e.g. Facebook-Like). | Custom reports |
| Social Source Referral | Social | Traffic Sources | ga:hasSocialSourceReferral | Whether or not this activity resulted from a social source. | Custom reports |
| Social Tags Summary | Social | Social Activities | ga:socialActivityTagsSummary | For a social data hub activity, this is a comma-separated set of tags associated with the social activity. | Custom reports |
| Social Type | Social | Social Interactions | Either “Socially Engaged” or “Not Socially Engaged”. | Custom reports | |
| Social User Handle | Social | Social Activities | ga:socialActivityUserHandle | Social User Handle — The handle of the user who initiated the social activity. | Custom reports |
| User Photo URL | Social | Social Activities | ga:socialActivityUserPhotoUrl | URL for the profile photo of the user who performed a social action. | Custom reports |
| User Profile URL | Social | Social Activities | ga:socialActivityUserProfileUrl | URL for the profile of the user who performed a social action. | Custom reports |
| Connection Speed | Other | Site Speed | The network connection speeds of visitors to your website. | Both | |
| Date* | Other | Time | ga:date | The dates of the active date range.* (Same as “Visit Date (YYYYMMDD)” in Advanced Segments) | Custom reports |
| Day of Week | Other | Time | ga:dayOfWeek | The day of the week. A one-digit number from 0 (Sunday) to 6 (Monday). | Custom reports |
| Hour | Other | Time | ga:hour | A two-digit hour of the day ranging from 00-23 in the timezone configured for the account. This value is also corrected for daylight savings time, adhering to all local rules for daylight savings time. If your timezone follows daylight savings time, there will be an apparent bump in the number of visits during the change-over hour (e.g. between 1:00 and 2:00) for the day per year when that hour repeats. A corresponding hour with zero visits will occur at the opposite changeover. (Google Analytics does not track visitor time more precisely than hours.) | Both |
| Hour of Day | Other | Time | Date and hour. | Custom reports | |
| Month of Year | Other | Time | ga:month | The month of the visit. A two digit integer from 01 to 12. | Custom reports |
| Visit Date (YYYYMMDD) | Other | Time | ga:date | Visit date in yyyymmdd format.*(Same as “Date” in Custom Reports) | Advanced Segments |
| Week of Year | Other | Time | ga:week | The week of the visit. A two-digit number from 01 to 53. Each week starts on Sunday. | Custom reports |
| Custom Variable (Key 1…n) | Custom Variables | Custom Variables | ga:customVarName(n) | The key name of the custom variable fro that slot. | Both |
| Custom Variable (Value 1…n) | Custom Variables | Custom Variables | ga:customVarValue(n) | The value name of the custom variable fro that slot. | Both |
| Affiliation | Conversions | Ecommerce | ga:affiliation | The affiliations assigned to ecommerce transactions. | Both |
| Days to Transaction | Conversions | Ecommerce | ga:daysToTransaction | The number of days between users’ purchases and the campaign referral. | Both |
| Goal Completion Location | Conversions | Goal Conversions | Goal Request URI | Custom reports | |
| Goal Previous Step – 3/2/1 | Conversions | Goal Conversions | The URI that was loaded 1, 2, or 3 steps prior to the goal completion location. | Custom reports | |
| Product | Conversions | Ecommerce | ga:productName | The product names of items sold. | Both |
| Product Category | Conversions | Ecommerce | ga:productCategory | The categories of products sold. | Both |
| Product SKU | Conversions | Ecommerce | ga:productSku | The product codes of items sold. | Custom reports |
| Transaction | Conversions | Ecommerce | ga:transactionId | The transaction IDs of the ecommerce transactions. | Custom reports |
| Visits to Transaction | Conversions | Ecommerce | ga:visitsToTransaction | The number of visits from referral to purchase. | Both |
| App ID | Content | App Tracking | The individual app ID designated by a specific app marketplace, like GooglePlay or another AppStore. | Both | |
| App Installer ID | Content | App Tracking | The name or package name for a specific app marketplace, like GooglePlay or another AppStore. | Both | |
| App Name | Content | App Tracking | The official name of your app as it’s designated by the developer in your account (e.g., MyApp: Special Edition). | Both | |
| App Version | Content | App Tracking | The version number of an app (e.g., 1.5). | Both | |
| Destination Page | Content | Internal Search | ga:searchDestinationPage | A page that the user visited after performing an internal website search. | Custom reports |
| Event Action | Content | Event Tracking | ga:eventAction | The actions that were assigned to triggered events. | Both |
| Event Category | Content | Event Tracking | ga:eventCategory | The categories that were assigned to triggered events. | Both |
| Event Label | Content | Event Tracking | ga:eventLabel | The optional labels used to describe triggered events. | Both |
| Exception Description | Content | App Tracking | The description of an exception, or technical error, as defined by your developer in your App Tracking code. | Both | |
| Exit Page | Content | Page Tracking | ga:exitPagePath | The pages visitors viewed last on your site. | Both |
| Exit Screen | Content | App Tracking | A screen from which visitors exit an app. | Custom reports | |
| Experiment ID | Content | Content Experiments | Visits by people who saw the experiment page. | Advanced Segments | |
| Full Referrer | Content | Traffic Sources | The URLs that referred traffic. | Custom reports | |
| Hostname | Content | Page Tracking | ga:hostname | The hostnames visitors used to reach your site. Typically, your site’s URL. | Both |
| Landing Page | Content | Page Tracking | ga:landingPagePath | The pages through which visitors entered your site. | Both |
| Landing Screen | Content | App Tracking | A screen through which visitors enter an app. | Custom reports | |
| Page | Content | Page Tracking | ga:pagePath | The pages visited, listed by path and/or query parameters. | Both |
| Page Depth | Content | Page Tracking | ga:pageDepth | The number of pages viewed by visitors in a session. | Both |
| Page path level 4/3/2/1 | Content | Page Tracking | ga:pagePathLevel1 | Page Path Level 1, 2, 3 or 4. | Custom reports |
| Page Title | Content | Page Tracking | ga:pageTitle | The page titles used on your site. | Both |
| Refined Keyword | Content | Internal Search | ga:searchKeywordRefinement | The search terms used to refine internal searches. | Both |
| Screen Depth | Content | Internal Search | The number of screens viewed in a session. | Both | |
| Screen Name | Content | App Tracking | The name of a specific app screen. | Custom reports | |
| Search Term | Content | Internal Search | ga:searchKeyword | The search terms used by visitors to search your site. | Both |
| Site Search Category | Content | Internal Search | ga:searchCategory | The categories searched by visitors searching your site. | Both |
| Site Search Status | Content | Internal Search | ga:searchUsed | Distinguishes visits that included an internal site search and those that did not. | Both |
| Start Page | Content | Internal Search | ga:searchStartPage | The pages from which visitors searched your site. | Custom reports |
| Timing Category | Content | User Timings | ga:userTimingCategory | User specified category for user timing. | Custom reports |
| Timing Label | Content | User Timings | ga:userTimingLabel | User specified label for user timing. | Custom reports |
| Timing Variable | Content | User Timings | ga:userTimingVariable | User timing variable | Custom reports |
| User Defined Value | Content | Visitor | ga:userDefinedValue | The value provided when you define custom visitor segments for your website. | Both |
| Variation | Content | Content Experiments | Visits for a specific combination of pages in an experiment; for example: Variation Page A and Goal A; Variation Page B and Goal A. | Advanced Segments | |
| Ad Content | Advertising | AdWords | ga:adContent | The first line of each AdWords ad and the utm_content tags that were used in tagged campaigns. | Both |
| Ad Distribution Network | Advertising | AdWords | ga:adDistributionNetwork | The location where your ad was shown (google.com, search partners, content network). | Custom reports |
| Ad Group | Advertising | AdWords | ga:adGroup | The names of your AdWords ad groups. | Both |
| Ad Slot | Advertising | AdWords | ga:adSlot | The location of the advertisement on the hosting page (Top, RHS, or not set). | Both |
| Ad Slot Position | Advertising | AdWords | ga:adSlotPostition | The ad slot positions in which your AdWords ads appeared (1-8). | Both |
| Campaign | Advertising | AdWords | TV campaign | Custom reports | |
| Campaign | Advertising | AdWords | ga:campaign | The names of your AdWords campaigns and the utm_campaign tags that were used in tagged campaigns. | Both |
| Destination URL | Advertising | AdWords | ga:adDestinationUrl | The URLs to which your AdWords ads referred traffic. | Custom reports |
| Keyword | Advertising | Traffic Sources | ga:keyword | All keywords, both paid and unpaid, used by users to reach your site. | Both |
| Match Type | Advertising | AdWords | ga:adMatchType | How the keyword was matched the query (i.e. exact, broad, phrase). | Custom reports |
| Matched Search Query | Advertising | AdWords | ga:adMatchedQuery | The actual search queries that triggered impressions of your AdWords ads. | Custom reports |
| Placement Domain | Advertising | AdWords | ga:adPlacementDomain | The domains where your ads on the content network were placed. | Custom reports |
| Placement Type | Advertising | AdWords | ga:adTargetingOption | Automatic placement or managed placement. | Custom reports |
| Placement URL | Advertising | AdWords | ga:adPlacementUrl | The URLs where your ads on the content network were placed. | Custom reports |
| Social Annotation Type | Advertising | AdWords | The type of +1 annotation made to your ads: None, Basic, or Personal. | Custom reports |
Cloudera Data Science Essentials Training
Posted by on April 25, 2013
Photo http://wikibon.org/blog/data-visualization/
Data Science Essentials Exam (DS-200) Preparation
Recommended Cloudera Training Courses
- Cloudera Developer Training for Apache Hadoop
- Introduction to Data Science – Building Recommender Systems
Online Data Science Resources
- New to Data Science: Tutorials, papers, background, meetups, a list of books, and links to our Data Science blog post from Cloudera Developer Resources.
- Data Processing & Analytics: Hadoop resources and materials listed by function.
- New to Hadoop: Introductory topics from Cloudera’s developer resources.
- http://www.quora.com/Data-Science
Books
- Hadoop: The Definitive Guide 3e by Tom White (Chapters 4, 7, 12, 15, 16)
- Hadoop In Practice by Alex Holmes (Chapters 2, 3, 8, 9, 10)
- Programming Collective Intelligence by Toby Segaran
- Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko
- Mahout In Action by Sean Owen, et al.
- Data-Intensive Text Processing with MapReduce by Jimmy Lin, et al. (PDF download) (Chapter 6)
- Beautiful Data by Toby Segaran, Jeff Hammerbacher (Chapter 5)
- Hadoop In Action by Chuck Lam (Chapter 12 – Case Studies)
- Introduction to Data Science online textbook (PDF download or interactive .epub)
- Pattern Recognition and Machine Learning
- A Programmers Guide to Data Mining (Free PDF download)
Blogs/misc.
- Cloudera’s blogs on data science
- Hillary Mason’s blog
- Math Babe
- FiveThirtyEight
- Kaggle Competitions blog
- Alex Holmes’s blog
- O’reilly Radar
- Flowing Data
- Gapminder
- MLcomp
Exam Sections
These are the current DS-200 Data Science Essentials beta exam sections
- Data Acquisition
- Data Evaluation
- Data Transformation
- Machine Learning Basics
- Clustering
- Classification
- Collaborative Filtering
- Model/Feature Selection
- Probability
- Visualization
- Optimization
Data Acquisition
Objectives
- Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
- Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
- Use command line tools such wget and curl
- Use Hadoop tools such as Sqoop and Flume
Section Study Resources
- Apache Sqoop is a tool for acquiring data from structured datastores. Cloudera’s blogs on Apache Sqoop. Aaron Kimball on Sqoop.
- Apache Flume, built for ingesting streaming data into HDFS. Cloudera’sblogs on Apache Flume. Cloudera’s blogs on data collection.
- HDFS File System Shell Guide
- Hadoop: The Definitive Guide, 3rd Edition: Chapter 15.
- Hadoop In Practice: Chapter 2.
Data Evaluation
Objectives
- Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
- Methods for working with various file formats including binary files, JSON, XML, and .csv
- Tools, techniques, and utilities for evaluating data from the command line and at scale
- An understanding of sampling and filtering techniques
- A familiarity with Hadoop SequenceFiles and serialization using Avro
Section Study Resources
- Hadoop: The Definitive Guide, 3rd Edition: Chapter 4.
- Hadoop In Practice: Chapter 3.
- Learn more about Apache Avro. Cloudera’s blogs on Apache Avro.
Data Transformation
Objectives
- Write a map-only Hadoop Streaming job
- Write a script that receives records on stdin and write them to stdout
- Invoke Unix tools to convert file formats
- Join data sets
- Write scripts to anonymize data sets
- Write a Mapper using Python and invoke via Hadoop streaming
- Write a custom subclass of FileOutputFormat
- Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat
Section Study Resources
- Read up on Hadoop Streaming
- Hadoop Streaming wiki
- Apache Hive facilitates easy analysis of large datasets stored in HDFS providing a SQL-like query language called HiveQL. Hive Tutorial, andLanguage Manual. Hive Joins documentation
- Apache Pig facilitates analysis of large datasets stored in HDFS providing a high-level language called Pig Latin. Pig’s Relational Operators
- Cloudera blog post: A guide to Python Frameworks for Hadoop by data scientist Uri Laserson
- Hadoop: The Definitive Guide, 3rd Edition: Chapters 7, 12
- Hadoop In Practice: Chapter 8, 10
Machine Learning Basics
Objectives
- Understand how to use Mappers and Reducers to create predictive models
- Understand the different kinds of machine learning, including supervised and unsupervised learning
- Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems
Section Study Resources
- Apache Mahout. Check out the Mahout wiki
- Cloudera’s blog category on Mahout
- Hadoop In Practice: Chapter 9
- Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
- Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
- A Programmers Guide to Data Mining
Clustering
Objectives
- Define clustering and identify appropriate use cases
- Identify appropriate uses of various models including centroid, distribution, density, group, and graph
- Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
- Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)
Section Study Resources
- Programming Collective Intelligence: Chapter 3
- Algorithms of the Intelligent Web: Chapter 4
- Mahout In Action: Part 2
Classification
Objectives
- Describe the steps for training a set of data in order to identify new data based on known data
- Identify the use cases for logistic regression, Bayes theorem
- Define classification techniques and formulas
Section Study Resources
- Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
- Algorithms of the Intelligent Web: Chapters 5, 6
- Mahout In Action: Part 3
Collaborative Filtering
Objectives
- Identify the use of user-based and item-based collaborative filtering techniques
- describe the limitations and strengths of collaborative filtering techniques
- Given a scenario, determine the appropriate collaborative filtering implementation
- Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system
Section Study Resources
- Recommendation engines with Apache Mahout
- Programming Collective Intelligence: Chapter 2
- Algorithms of the Intelligent Web: Chapter 3
- Mahout In Action: Part 1
Model/Feature Selection
Objectives
- Describe the role and function of feature selection
- Analyze a scenario and determine the appropriate features and attributes to select
- Analyze a scenario and determine the methods to deploy for optimal feature selection
Section Study Resources
- Programming Collective Intelligence: Chapter 10
- Pattern Recognition and Machine Learning: Chapter 1.3
Probability
Objectives
- Analyze a scenario and determine the likelihood of a particular outcome
- Determine sample percentiles
- Determine a range of items based on a sample probability density function
- Summarize a distribution of sample numbers
Section Study Resources
- Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
- Pattern Recognition and Machine Learning: Chapter 2
- Probability, Statistics, Bayes Theorem at better explained
Visualization
Objectives
- Determine the most effective visualization for a given problem
- Analyze a data visualization and interpret its meaning
Section Study Resources
- Data Visualization: Modern Approaches
- Data Visualization basics
- Sample Visualizations
- datavisualization.ch
- Data Visualization for Human Perception
Optimization
Objectives
- Understand optimization methods
- Identify 1st order and 2nd order optimization techniques
- Determine the learning rate for a particular algorithm
- Determine the sources of errors in a model
Section Study Resources
- Leon Bottou on Stochastic Learning from Advanced Lectures on Machine Learning
- Leon Bottou on Online Algorithms and Stochastic Approximations
- Programming Collective Intelligence: Chapter 5
- Data-Intensive Text Processing with MapReduce: Chapter 6
http://university.cloudera.com/certification/prep/datascience.html
“Start with a small data project” – Alex Hutton
Posted by on April 24, 2013
I recently watched a webinar by Alex Hutton on Security Data Analytics. It reminded me of this HBR post.
To Succeed with Big Data, Start Small
http://blogs.hbr.org/cs/2012/10/to_succeed_with_big_data_start.html
While it isn’t hard to argue the value of analyzing big data, it is intimidating to figure out what to do first. There are many unknowns when working with data that your organization has never used before — the streams of unstructured information from the web, for example. Which elements of the data hold value? What are the most important metrics the data can generate? What quality issues exist? As a result of these unknowns, the costs and time required to achieve success can be hard to estimate.
As an organization gains experience with specific types of data, certain issues will fade, but there will always be another new data source with the same unknowns waiting in the wings. The key to success is to start small. It’s a lower-risk way to see what big data can do for your firm and to test your firm’s readiness to use it.
The Traditional Way
In most organizations, big data projects get their start when an executive becomes convinced that the company is missing out on opportunities in data. Perhaps it’s the CMO looking to glean new insight into customer behavior from web data, for example. That conviction leads to an exhaustive and time-consuming process by which the CMO’s team might work with the CIO’s team to specify and scope the precise insights to be pursued and the associated analytics to get them.
Next, the organization launches a major IT project. The CIO’s team designs and implements complex processes to capture all the raw web data needed and transform it into usable (structured) information that can then be analyzed.
Once analytic professionals start using the data, they’ll find problems with the approach. This triggers another iteration of the IT project. Repeat a few times and everyone will be pulling their hair out and questioning why they ever decided to try to analyze the web data in the first place. This is a scenario I have seen play out many times in many organizations.
A Better Approach
The process I just described doesn’t work for big data initiatives because it’s designed for cases where all the facts are known, all the risks are identified, and all steps are clear — exactly what youwon’t find with a big data initiative. After all, you’re applying a new data source to new problems in a new way.
Again, my best advice is to start small. First, define a few relatively simple analytics that won’t take much time or data to run. For example, an online retailer might start by identifying what products each customer viewed so that the company can send a follow-up offer if they don’t purchase. A few intuitive examples like this allow the organization to see what the data can do. More importantly, this approach yields results that are easy to test to see what type of lift the analytics provide.
Next, instead of setting up formal processes to capture, process, and analyze all of the data all of the time, capture some of the data in a one-off fashion. Perhaps a month’s worth for one division for a certain subset of products. If you capture only the data you need to perform the test, you’ll find the initial data volume easier to manage and you won’t muddy the water with a bunch of other data — a problem that plagues many big data initiatives.
At this point, it is time to turn analytic professionals loose on the data. Remember: they’re used to dealing with raw data in an unfriendly format. They can zero in on what they need and ignore the rest. They can create test and control groups to whom they can send the follow-up offers, and then they can help analyze the results. During this process, they’ll also learn an awful lot about the data and how to make use of it. This kind of targeted prototyping is invaluable when it comes to identifying trouble and firming up a broader effort.
Successful prototypes also make it far easier to get the support required for the larger effort. Best of all, the full effort will now be less risky because the data is better understood and the value is already partially proven. It’s also worthwhile to learn that the initial analytics aren’t as valuable as hoped. It tells you to focus effort elsewhere before you’ve wasted many months and a lot of money.
Pursuing big data with small, targeted steps can actually be the fastest, least expensive, and most effective way to go. It enables an organization to prove there’s value in major investment before making it and to understand better how to make a big data program pay off for the long term.
PCAP analysis via VirusTotal
Posted by on April 22, 2013
VirusTotal += PCAP Analyzer
VirusTotal is a greedy creature, one of its gluttonous wishes is to be able to understand and characterize all the races it encounters, it already understood the insurgent collective of Portable Executables, the greenish creatures known as Android APKs, the talkative PDF civilization, etc. as of today it also figures out PCAPs, a rare group of individuals obsessed with recording everything they see.
PCAP files contain network packet data created during a live network capture, often used for packet sniffing and analyzing data network characteristics. In the malware research field PCAPs are often used to:
- Record malware network communication when executed in sandboxed environments.
- Record honeyclient browser exploitation traces.
- Log network activity seen by network appliances and IDS.
- etc.
http://blog.virustotal.com/2013/04/virustotal-pcap-analyzer.html
Data Scientist Roles
Posted by on April 15, 2013
http://www.fastcolabs.com/3008620/lessons-crash-course-data-science
Data Science Bootcamp
THE IDEA BEHIND BIG DIVE IS TO BOOST THE GROWTH OF A NEW GENERATION OF DEVELOPERS.
A street-fighting gym where high value datasets are the raw material in the hands of a bunch of ambitious smart geeks tutored and mentored by experts in three key areas: Development, Visualization and Data Science.













