Photo http://wikibon.org/blog/data-visualization/
Data Science Essentials Exam (DS-200) Preparation
Online Data Science Resources
Books
- Hadoop: The Definitive Guide 3e by Tom White (Chapters 4, 7, 12, 15, 16)
- Hadoop In Practice by Alex Holmes (Chapters 2, 3, 8, 9, 10)
- Programming Collective Intelligence by Toby Segaran
- Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko
- Mahout In Action by Sean Owen, et al.
- Data-Intensive Text Processing with MapReduce by Jimmy Lin, et al. (PDF download) (Chapter 6)
- Beautiful Data by Toby Segaran, Jeff Hammerbacher (Chapter 5)
- Hadoop In Action by Chuck Lam (Chapter 12 – Case Studies)
- Introduction to Data Science online textbook (PDF download or interactive .epub)
- Pattern Recognition and Machine Learning
- A Programmers Guide to Data Mining (Free PDF download)
Blogs/misc.
Exam Sections
These are the current DS-200 Data Science Essentials beta exam sections
- Data Acquisition
- Data Evaluation
- Data Transformation
- Machine Learning Basics
- Clustering
- Classification
- Collaborative Filtering
- Model/Feature Selection
- Probability
- Visualization
- Optimization
Data Acquisition
Objectives
- Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
- Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
- Use command line tools such wget and curl
- Use Hadoop tools such as Sqoop and Flume
Section Study Resources
Data Evaluation
Objectives
- Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
- Methods for working with various file formats including binary files, JSON, XML, and .csv
- Tools, techniques, and utilities for evaluating data from the command line and at scale
- An understanding of sampling and filtering techniques
- A familiarity with Hadoop SequenceFiles and serialization using Avro
Section Study Resources
Data Transformation
Objectives
- Write a map-only Hadoop Streaming job
- Write a script that receives records on stdin and write them to stdout
- Invoke Unix tools to convert file formats
- Join data sets
- Write scripts to anonymize data sets
- Write a Mapper using Python and invoke via Hadoop streaming
- Write a custom subclass of FileOutputFormat
- Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat
Section Study Resources
Machine Learning Basics
Objectives
- Understand how to use Mappers and Reducers to create predictive models
- Understand the different kinds of machine learning, including supervised and unsupervised learning
- Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems
Section Study Resources
- Apache Mahout. Check out the Mahout wiki
- Cloudera’s blog category on Mahout
- Hadoop In Practice: Chapter 9
- Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
- Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
- A Programmers Guide to Data Mining
Clustering
Objectives
- Define clustering and identify appropriate use cases
- Identify appropriate uses of various models including centroid, distribution, density, group, and graph
- Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
- Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)
Section Study Resources
- Programming Collective Intelligence: Chapter 3
- Algorithms of the Intelligent Web: Chapter 4
- Mahout In Action: Part 2
Classification
Objectives
- Describe the steps for training a set of data in order to identify new data based on known data
- Identify the use cases for logistic regression, Bayes theorem
- Define classification techniques and formulas
Section Study Resources
- Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
- Algorithms of the Intelligent Web: Chapters 5, 6
- Mahout In Action: Part 3
Collaborative Filtering
Objectives
- Identify the use of user-based and item-based collaborative filtering techniques
- describe the limitations and strengths of collaborative filtering techniques
- Given a scenario, determine the appropriate collaborative filtering implementation
- Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system
Section Study Resources
Model/Feature Selection
Objectives
- Describe the role and function of feature selection
- Analyze a scenario and determine the appropriate features and attributes to select
- Analyze a scenario and determine the methods to deploy for optimal feature selection
Section Study Resources
- Programming Collective Intelligence: Chapter 10
- Pattern Recognition and Machine Learning: Chapter 1.3
Probability
Objectives
- Analyze a scenario and determine the likelihood of a particular outcome
- Determine sample percentiles
- Determine a range of items based on a sample probability density function
- Summarize a distribution of sample numbers
Section Study Resources
- Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
- Pattern Recognition and Machine Learning: Chapter 2
- Probability, Statistics, Bayes Theorem at better explained
Visualization
Objectives
- Determine the most effective visualization for a given problem
- Analyze a data visualization and interpret its meaning
Section Study Resources
Optimization
Objectives
- Understand optimization methods
- Identify 1st order and 2nd order optimization techniques
- Determine the learning rate for a particular algorithm
- Determine the sources of errors in a model
Section Study Resources
http://cloudera.com/content/cloudera/en/training/certification/ccp-ds/essentials/prep.html