BigSnarf blog

Infosec FTW

Cloudera Data Science Essentials Training

Leave a comment Posted by Security Dude on April 25, 2013

Photo http://wikibon.org/blog/data-visualization/

Data Science Essentials Exam (DS-200) Preparation

Online Data Science Resources

New to Data Science: Tutorials, papers, background, meetups, a list of books, and links to our Data Science blog post from Cloudera Developer Resources.
Data Processing & Analytics: Hadoop resources and materials listed by function.
New to Hadoop: Introductory topics from Cloudera’s developer resources.
http://www.quora.com/Data-Science

Books

Blogs/misc.

Exam Sections

These are the current DS-200 Data Science Essentials beta exam sections

Data Acquisition

Objectives

Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
Use command line tools such wget and curl
Use Hadoop tools such as Sqoop and Flume

Section Study Resources

Apache Sqoop is a tool for acquiring data from structured datastores. Cloudera’s blogs on Apache Sqoop. Aaron Kimball on Sqoop.
Apache Flume, built for ingesting streaming data into HDFS. Cloudera’sblogs on Apache Flume. Cloudera’s blogs on data collection.
HDFS File System Shell Guide
Hadoop: The Definitive Guide, 3rd Edition: Chapter 15.
Hadoop In Practice: Chapter 2.

Data Evaluation

Objectives

Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
Methods for working with various file formats including binary files, JSON, XML, and .csv
Tools, techniques, and utilities for evaluating data from the command line and at scale
An understanding of sampling and filtering techniques
A familiarity with Hadoop SequenceFiles and serialization using Avro

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 4.
Hadoop In Practice: Chapter 3.
Learn more about Apache Avro. Cloudera’s blogs on Apache Avro.

Data Transformation

Objectives

Write a map-only Hadoop Streaming job
Write a script that receives records on stdin and write them to stdout
Invoke Unix tools to convert file formats
Join data sets
Write scripts to anonymize data sets
Write a Mapper using Python and invoke via Hadoop streaming
Write a custom subclass of FileOutputFormat
Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat

Section Study Resources

Read up on Hadoop Streaming
Hadoop Streaming wiki
Apache Hive facilitates easy analysis of large datasets stored in HDFS providing a SQL-like query language called HiveQL. Hive Tutorial, andLanguage Manual. Hive Joins documentation
Apache Pig facilitates analysis of large datasets stored in HDFS providing a high-level language called Pig Latin. Pig’s Relational Operators
Cloudera blog post: A guide to Python Frameworks for Hadoop by data scientist Uri Laserson
Hadoop: The Definitive Guide, 3rd Edition: Chapters 7, 12
Hadoop In Practice: Chapter 8, 10

Machine Learning Basics

Objectives

Understand how to use Mappers and Reducers to create predictive models
Understand the different kinds of machine learning, including supervised and unsupervised learning
Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems

Section Study Resources

Apache Mahout. Check out the Mahout wiki
Cloudera’s blog category on Mahout
Hadoop In Practice: Chapter 9
Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
A Programmers Guide to Data Mining

Clustering

Objectives

Define clustering and identify appropriate use cases
Identify appropriate uses of various models including centroid, distribution, density, group, and graph
Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)

Section Study Resources

Programming Collective Intelligence: Chapter 3
Algorithms of the Intelligent Web: Chapter 4
Mahout In Action: Part 2

Classification

Objectives

Describe the steps for training a set of data in order to identify new data based on known data
Identify the use cases for logistic regression, Bayes theorem
Define classification techniques and formulas

Section Study Resources

Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
Algorithms of the Intelligent Web: Chapters 5, 6
Mahout In Action: Part 3

Collaborative Filtering

Objectives

Identify the use of user-based and item-based collaborative filtering techniques
describe the limitations and strengths of collaborative filtering techniques
Given a scenario, determine the appropriate collaborative filtering implementation
Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system

Section Study Resources

Recommendation engines with Apache Mahout
Programming Collective Intelligence: Chapter 2
Algorithms of the Intelligent Web: Chapter 3
Mahout In Action: Part 1

Model/Feature Selection

Objectives

Describe the role and function of feature selection
Analyze a scenario and determine the appropriate features and attributes to select
Analyze a scenario and determine the methods to deploy for optimal feature selection

Section Study Resources

Programming Collective Intelligence: Chapter 10
Pattern Recognition and Machine Learning: Chapter 1.3

Probability

Objectives

Analyze a scenario and determine the likelihood of a particular outcome
Determine sample percentiles
Determine a range of items based on a sample probability density function
Summarize a distribution of sample numbers

Section Study Resources

Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
Pattern Recognition and Machine Learning: Chapter 2
Probability, Statistics, Bayes Theorem at better explained

Visualization

Objectives

Determine the most effective visualization for a given problem
Analyze a data visualization and interpret its meaning

Section Study Resources

Optimization

Objectives

Understand optimization methods
Identify 1st order and 2nd order optimization techniques
Determine the learning rate for a particular algorithm
Determine the sources of errors in a model

Section Study Resources

Leon Bottou on Stochastic Learning from Advanced Lectures on Machine Learning
Leon Bottou on Online Algorithms and Stochastic Approximations
Programming Collective Intelligence: Chapter 5
Data-Intensive Text Processing with MapReduce: Chapter 6

http://cloudera.com/content/cloudera/en/training/certification/ccp-ds/essentials/prep.html

Thoughts

← “Start with a small data project” – Alex Hutton Google Analytics Report Metrics →

BigSnarf blog

Cloudera Data Science Essentials Training

Books

Blogs/misc.

Exam Sections

Data Acquisition

Objectives

Section Study Resources

Data Evaluation

Objectives

Section Study Resources

Data Transformation

Objectives

Section Study Resources

Machine Learning Basics

Objectives

Section Study Resources

Clustering

Objectives

Section Study Resources

Classification

Objectives

Section Study Resources

Collaborative Filtering

Objectives

Section Study Resources

Model/Feature Selection

Objectives

Section Study Resources

Probability

Objectives

Section Study Resources

Visualization

Objectives

Section Study Resources

Optimization

Objectives

Section Study Resources

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta