Cloudera Data Analyst Training: Using Pig, Hive and Impala with Hadoop (CDAPHIH) – Outline

Detailed Course Outline

Introduction
Apache Hadoop Fundamentals
  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Database Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenarios
Introduction to Apache Pig
  • What is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig
Basic Data Analysis with Apache Pig
  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly Used Functions
Processing Complex Data with Apache Pig
  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data
Multi-Dataset Operations with Apache Pig
  • Techniques for Combining Datasets
  • Joining Datasets in Pig
  • Set Operations
  • Splitting Datasets
Apache Pig Troubleshooting and Optimization
  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Pig Jobs
Introduction to Apache Hive and Impala
  • What is Hive?
  • What is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases
Querying with Apache Hive and Impala
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive’s Shell)
  • Using the Impala Shell
Apache Hive and Impala Data Management
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
Data Storage and Performance
  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats
Relational Data Analysis with Apache Hive and Impala
  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing
Complex Data with Apache Hive and Impala
  • Complex Data with Hive
  • Complex Data with Impala
Analyzing Text with Apache Hive and Impala
  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams in Hive
Apache Hive Optimization
  • Understanding Query Performance
  • Bucketing
  • Indexing Data
  • Hive on Spark
Apache Impala Optimization
  • How Impala Executes Queries
  • Improving Impala Performance
Extending Apache Hive and Impala
  • Custom SerDes and File Formats in Hive
  • Data Transformation with
  • Custom Scripts in Hive
  • User-Defined Functions
  • Parameterized Queries
Choosing the Best Tool for the Job
  • Comparing Pig, Hive, Impala, and Relational Databases
  • Which to Choose?
Conclusion