BigSnarf blog

Infosec FTW

Cloudera Data Science Essentials Training




Data Science Essentials Exam (DS-200) Preparation


Online Data Science Resources



Exam Sections

These are the current DS-200 Data Science Essentials beta exam sections

  1. Data Acquisition
  2. Data Evaluation
  3. Data Transformation
  4. Machine Learning Basics
  5. Clustering
  6. Classification
  7. Collaborative Filtering
  8. Model/Feature Selection
  9. Probability
  10. Visualization
  11. Optimization

Data Acquisition


  • Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
  • Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
  • Use command line tools such wget and curl
  • Use Hadoop tools such as Sqoop and Flume

Section Study Resources

Data Evaluation


  • Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
  • Methods for working with various file formats including binary files, JSON, XML, and .csv
  • Tools, techniques, and utilities for evaluating data from the command line and at scale
  • An understanding of sampling and filtering techniques
  • A familiarity with Hadoop SequenceFiles and serialization using Avro

Section Study Resources

Data Transformation


  • Write a map-only Hadoop Streaming job
  • Write a script that receives records on stdin and write them to stdout
  • Invoke Unix tools to convert file formats
  • Join data sets
  • Write scripts to anonymize data sets
  • Write a Mapper using Python and invoke via Hadoop streaming
  • Write a custom subclass of FileOutputFormat
  • Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat

Section Study Resources

Machine Learning Basics


  • Understand how to use Mappers and Reducers to create predictive models
  • Understand the different kinds of machine learning, including supervised and unsupervised learning
  • Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems

Section Study Resources

  • Apache Mahout. Check out the Mahout wiki
  • Cloudera’s blog category on Mahout
  • Hadoop In Practice: Chapter 9
  • Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
  • Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
  • A Programmers Guide to Data Mining



  • Define clustering and identify appropriate use cases
  • Identify appropriate uses of various models including centroid, distribution, density, group, and graph
  • Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
  • Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)

Section Study Resources

  • Programming Collective Intelligence: Chapter 3
  • Algorithms of the Intelligent Web: Chapter 4
  • Mahout In Action: Part 2



  • Describe the steps for training a set of data in order to identify new data based on known data
  • Identify the use cases for logistic regression, Bayes theorem
  • Define classification techniques and formulas

Section Study Resources

  • Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
  • Algorithms of the Intelligent Web: Chapters 5, 6
  • Mahout In Action: Part 3

Collaborative Filtering


  • Identify the use of user-based and item-based collaborative filtering techniques
  • describe the limitations and strengths of collaborative filtering techniques
  • Given a scenario, determine the appropriate collaborative filtering implementation
  • Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system

Section Study Resources

Model/Feature Selection


  • Describe the role and function of feature selection
  • Analyze a scenario and determine the appropriate features and attributes to select
  • Analyze a scenario and determine the methods to deploy for optimal feature selection

Section Study Resources

  • Programming Collective Intelligence: Chapter 10
  • Pattern Recognition and Machine Learning: Chapter 1.3



  • Analyze a scenario and determine the likelihood of a particular outcome
  • Determine sample percentiles
  • Determine a range of items based on a sample probability density function
  • Summarize a distribution of sample numbers

Section Study Resources

  • Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
  • Pattern Recognition and Machine Learning: Chapter 2
  • Probability, Statistics, Bayes Theorem at better explained



  • Determine the most effective visualization for a given problem
  • Analyze a data visualization and interpret its meaning

Section Study Resources



  • Understand optimization methods
  • Identify 1st order and 2nd order optimization techniques
  • Determine the learning rate for a particular algorithm
  • Determine the sources of errors in a model

Section Study Resources


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: