BigSnarf blog

Infosec FTW

Cloudera Data Science Essentials Training

pb_visualizing_f

Photo http://wikibon.org/blog/data-visualization/

 

Data Science Essentials Exam (DS-200) Preparation

 

Online Data Science Resources


Books


Blogs/misc.


Exam Sections

These are the current DS-200 Data Science Essentials beta exam sections

  1. Data Acquisition
  2. Data Evaluation
  3. Data Transformation
  4. Machine Learning Basics
  5. Clustering
  6. Classification
  7. Collaborative Filtering
  8. Model/Feature Selection
  9. Probability
  10. Visualization
  11. Optimization

Data Acquisition

Objectives

  • Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
  • Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
  • Use command line tools such wget and curl
  • Use Hadoop tools such as Sqoop and Flume

Section Study Resources


Data Evaluation

Objectives

  • Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
  • Methods for working with various file formats including binary files, JSON, XML, and .csv
  • Tools, techniques, and utilities for evaluating data from the command line and at scale
  • An understanding of sampling and filtering techniques
  • A familiarity with Hadoop SequenceFiles and serialization using Avro

Section Study Resources


Data Transformation

Objectives

  • Write a map-only Hadoop Streaming job
  • Write a script that receives records on stdin and write them to stdout
  • Invoke Unix tools to convert file formats
  • Join data sets
  • Write scripts to anonymize data sets
  • Write a Mapper using Python and invoke via Hadoop streaming
  • Write a custom subclass of FileOutputFormat
  • Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat

Section Study Resources


Machine Learning Basics

Objectives

  • Understand how to use Mappers and Reducers to create predictive models
  • Understand the different kinds of machine learning, including supervised and unsupervised learning
  • Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems

Section Study Resources

  • Apache Mahout. Check out the Mahout wiki
  • Cloudera’s blog category on Mahout
  • Hadoop In Practice: Chapter 9
  • Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 – Case Studies
  • Algorithms of the Intelligent Web: Chapter 7 – (Use Cases)
  • A Programmers Guide to Data Mining

Clustering

Objectives

  • Define clustering and identify appropriate use cases
  • Identify appropriate uses of various models including centroid, distribution, density, group, and graph
  • Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
  • Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)

Section Study Resources

  • Programming Collective Intelligence: Chapter 3
  • Algorithms of the Intelligent Web: Chapter 4
  • Mahout In Action: Part 2

Classification

Objectives

  • Describe the steps for training a set of data in order to identify new data based on known data
  • Identify the use cases for logistic regression, Bayes theorem
  • Define classification techniques and formulas

Section Study Resources

  • Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
  • Algorithms of the Intelligent Web: Chapters 5, 6
  • Mahout In Action: Part 3

Collaborative Filtering

Objectives

  • Identify the use of user-based and item-based collaborative filtering techniques
  • describe the limitations and strengths of collaborative filtering techniques
  • Given a scenario, determine the appropriate collaborative filtering implementation
  • Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system

Section Study Resources


Model/Feature Selection

Objectives

  • Describe the role and function of feature selection
  • Analyze a scenario and determine the appropriate features and attributes to select
  • Analyze a scenario and determine the methods to deploy for optimal feature selection

Section Study Resources

  • Programming Collective Intelligence: Chapter 10
  • Pattern Recognition and Machine Learning: Chapter 1.3

Probability

Objectives

  • Analyze a scenario and determine the likelihood of a particular outcome
  • Determine sample percentiles
  • Determine a range of items based on a sample probability density function
  • Summarize a distribution of sample numbers

Section Study Resources

  • Programming Collective Intelligence: Chapter 8 (Estimating Probability Density)
  • Pattern Recognition and Machine Learning: Chapter 2
  • Probability, Statistics, Bayes Theorem at better explained

Visualization

Objectives

  • Determine the most effective visualization for a given problem
  • Analyze a data visualization and interpret its meaning

Section Study Resources


Optimization

Objectives

  • Understand optimization methods
  • Identify 1st order and 2nd order optimization techniques
  • Determine the learning rate for a particular algorithm
  • Determine the sources of errors in a model

Section Study Resources

http://cloudera.com/content/cloudera/en/training/certification/ccp-ds/essentials/prep.html

Leave a comment