Cloudera Training Partner Logo

Cloudera Data Scientist Training

Cloudera Training Partner Logo
This workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines. They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favorite Python or R packages.
 
The Apache Spark demonstrations and exercises are conducted in Python (with PySpark) and R (with sparklyr) using the Cloudera Data Science Workbench (CDSW) environment.

Course Contents

  • Data Science Overview
  • Cloudera Data Science Workbench (CDSW)
  • Science Workbench
  • Workbench Works
  • Workbench
  • Case Study
  • Apache Spark
  • Summarizing and Grouping DataFrames
  • Window Functions
  • Exploring DataFrames
  • Apache Spark Job Execution
  • Processing Text and Training and Evaluating Topic Models
  • Training and Evaluating Recommender Models
  • Running a Spark Application from (CDSW)
  • Columns of a DataFrame
  • Inspecting a Spark SQL DataFrame
  • Transforming DataFrames
  • Monitoring, Tuning, and Configuring Spark Applications
  • Machine Learning Overview
  • Training and Evaluating Regression Models
  • Working with Machine Learning Pipelines
  • Deploying Machine Learning Pipelines
  • Transforming DataFrame Columns
  • Complex Types
  • User-Defined Functions
  • Reading and Writing Data
  • Combining and Splitting DataFrames
  • Training and Evaluating Classification Models
  • Tuning Algorithm Hyperparameters Using Grid Search
  • Training and Evaluating Clustering Models
  • Overview of sparklyr
  • Introduction to Additional CDSW Features

E-Book Symbol You will receive the original course documentation by Cloudera in English language as an E-Book (pdf).

Target Group

The workshop is designed for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters. Data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful.

Knowledge Prerequisites

Participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

Classroom training

Do you prefer the classic training method? A course in one of our Training Centers, with a competent trainer and the direct exchange between all course participants? Then you should book one of our classroom training dates!

Online training

You wish to attend a course in online mode? We offer you online course dates for this course topic. To attend these seminars, you need to have a PC with Internet access (minimum data rate 1Mbps), a headset when working via VoIP and optionally a camera. For further information and technical recommendations, please refer to.

Tailor-made courses

You need a special course for your team? In addition to our standard offer, we will also support you in creating your customized courses, which precisely meet your individual demands. We will be glad to consult you and create an individual offer for you.
Request for customized courses
PDF SymbolYou can find the complete description of this course with dates and prices ready for download at as PDF.

This workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines. They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favorite Python or R packages.
 
The Apache Spark demonstrations and exercises are conducted in Python (with PySpark) and R (with sparklyr) using the Cloudera Data Science Workbench (CDSW) environment.

Course Contents

  • Data Science Overview
  • Cloudera Data Science Workbench (CDSW)
  • Science Workbench
  • Workbench Works
  • Workbench
  • Case Study
  • Apache Spark
  • Summarizing and Grouping DataFrames
  • Window Functions
  • Exploring DataFrames
  • Apache Spark Job Execution
  • Processing Text and Training and Evaluating Topic Models
  • Training and Evaluating Recommender Models
  • Running a Spark Application from (CDSW)
  • Columns of a DataFrame
  • Inspecting a Spark SQL DataFrame
  • Transforming DataFrames
  • Monitoring, Tuning, and Configuring Spark Applications
  • Machine Learning Overview
  • Training and Evaluating Regression Models
  • Working with Machine Learning Pipelines
  • Deploying Machine Learning Pipelines
  • Transforming DataFrame Columns
  • Complex Types
  • User-Defined Functions
  • Reading and Writing Data
  • Combining and Splitting DataFrames
  • Training and Evaluating Classification Models
  • Tuning Algorithm Hyperparameters Using Grid Search
  • Training and Evaluating Clustering Models
  • Overview of sparklyr
  • Introduction to Additional CDSW Features

E-Book Symbol You will receive the original course documentation by Cloudera in English language as an E-Book (pdf).

Target Group

The workshop is designed for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters. Data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful.

Knowledge Prerequisites

Participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

Classroom training

Do you prefer the classic training method? A course in one of our Training Centers, with a competent trainer and the direct exchange between all course participants? Then you should book one of our classroom training dates!

Online training

You wish to attend a course in online mode? We offer you online course dates for this course topic. To attend these seminars, you need to have a PC with Internet access (minimum data rate 1Mbps), a headset when working via VoIP and optionally a camera. For further information and technical recommendations, please refer to.

Tailor-made courses

You need a special course for your team? In addition to our standard offer, we will also support you in creating your customized courses, which precisely meet your individual demands. We will be glad to consult you and create an individual offer for you.
Request for customized courses

PDF SymbolYou can find the complete description of this course with dates and prices ready for download at as PDF.