About the Course

Apache Hadoop & Big Data course teaches the skill set required for the learners to store Big Data using Hadoop HDFS and how to process/analyze the Big Data using Map-Reduce Programming or by using other Hadoop ecosystems.


Learn from industry experts with live instructor-led training

Projects & Lab

Apply the skills you learn to solve real-world problems.


Highlight your new skills on your resume or LinkedIn.                      

1:1 Mentoring

Get guidance from industry leaders and professionals.

Best-in-class Support

24×7 support and forum access to answer all your queries throughout your learning journey.


Compatible to Hortonworks Certified Developer (HDPCD)



17 Sept 2018
Online Instructor Based Training
30 days
371 36,000


Mon to Fri (4 weeks)
10 AM - 12 PM
30 days
514 46,000

Mon to Fri (4 weeks)
10 AM - 12 PM
30 days
514 46,000

Learning Path


About the Course

This course is a part of the Specialization Course in Big Data with Hadoop.

What is Big data

Big Data opportunities, Challenges

Characteristics of Big data

Hadoop Distributed File System

Comparing Hadoop & SQL

Industries using Hadoop

Data Locality

Hadoop Architecture

Map Reduce & HDFS

Using the Hadoop single node image (Clone)

HDFS Design & Concepts

Blocks, Name nodes and Data nodes

HDFS High-Availability and HDFS Federation

Hadoop DFS The Command-Line Interface

Basic File System Operations

Anatomy of File Read,File Write

Block Placement Policy and Modes

More detailed explanation about Configuration files

Metadata, FS image, Edit log, Secondary Name Node and Safe Mode

How to add New Data Node dynamically,decommission a Data Node dynamically (Without stopping cluster)

FSCK Utility. (Block report)

How to override default configuration at system level and Programming level

HDFS Federation

ZOOKEEPER Leader Election Algorithm

Exercise and small use case on HDFS

Map Reduce Functional Programming Basics

Map and Reduce Basics

How Map Reduce Works

Anatomy of a Map Reduce Job Run

Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task Execution, Progress and Status Updates

Job Completion, Failures

Shuffling and Sorting

Splits, Record reader, Partition, Types of partitions & Combiner

Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots

Types of Schedulers and Counters

Comparisons between Old and New API at code and Architecture Level

Getting the data from RDBMS into HDFS using Custom data types

Distributed Cache and Hadoop Streaming (Python, Ruby and R)


Sequential Files and Map Files

Enabling Compression Codec’s

Map side Join with distributed Cache

Types of I/O Formats: Multiple outputs, NLINEinputformat

Handling small files using CombineFileInputFormat

Hands on “Word Count” in Map Reduce in standalone and Pseudo distribution Mode

Sorting files using Hadoop Configuration API discussion

Emulating “grep” for searching inside a file in Hadoop

DBInput Format

Job Dependency API discussion

Input Format API discussion,Split API discussion

Custom Data type creation in Hadoop


CAP Theorem and Types of Consistency

Types of NoSQL Databases in detail

Columnar Databases in Detail (HBASE and CASSANDRA)

TTL, Bloom Filters and Compensation

HBase Installation, Concepts

HBase Data Model and Comparison between RDBMS and NOSQL

Master & Region Servers

HBase Operations (DDL and DML) through Shell and Programming and HBase Architecture

Catalog Tables

Block Cache and sharding


DATA Modeling (Sequential, Salted, Promoted and Random Keys)

JAVA API’s and Rest Interface

Client Side Buffering and Process 1 million records using Client side Buffering

HBase Counters

Enabling Replication and HBase RAW Scans

HBase Filters

Bulk Loading and Co processors (Endpoints and Observers with programs)

Real world use case consisting of HDFS,MR and HBASE

Hive Installation, Introduction and Architecture

Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)

Meta store, Hive QL


Working with Tables

Primitive data types and complex data types

Working with Partitions

User Defined Functions

Hive Bucketed Tables and Sampling

External partitioned tables, Map the data to the partition in the table, Writing the output of one query to another table, Multiple inserts

Dynamic Partition

Differences between ORDER BY, DISTRIBUTE BY and SORT BY

Bucketing and Sorted Bucketing with Dynamic partition

RC File



Compression on hive tables and Migrating Hive tables

Dynamic substation of Hive and Different ways of running Hive

How to enable Update in HIVE

Log Analysis on Hive

Access HBASE tables using Hive

Hands on Exercises

Pig Installation

Execution Types

Grunt Shell

Pig Latin

Data Processing

Schema on read

Primitive data types and complex data types

Tuple schema, BAG Schema and MAP Schema

Loading and Storing

Filtering, Grouping and Joining

Debugging commands (Illustrate and Explain)

Validations,Type casting in PIG

Working with Functions

User Defined Functions

Types of JOINS in pig and Replicated Join in detail

SPLITS and Multiquery execution

Error Handling, FLATTEN and ORDER BY

Parameter Substitution

Nested For Each

User Defined Functions, Dynamic Invokers and Macros

How to access HBASE using PIG, Load and Write JSON DATA using PIG Piggy Bank

Hands on Exercises

Sqoop Installation

Import Data.(Full table, Only Subset, Target Directory, protecting Password, file format other than CSV, Compressing, Control Parallelism, All tables Import)

Incremental Import(Import only New data, Last Imported data, storing Password in Metastore, Sharing Metastore between Sqoop Clients)

Free Form Query Import

Export data to RDBMS,HIVE and HBASE

Hands on Exercises

HCatalog Installation

Introduction to HCatalog

About Hcatalog with PIG,HIVE and MR

Hands on Exercises

Flume Installation

Introduction to Flume

Flume Agents: Sources, Channels and Sinks

Log User information using Java program in to HDFS using LOG4J and Avro Source, Tail Source

Log User information using Java program in to HBASE using LOG4J and Avro Source, Tail Source

Flume Commands

Use case of Flume: Flume the data from twitter in to HDFS and HBASE. Do some analysis using HIVE and PIG

Hortonworks and Cloudera

Workflow (Action, Start, Action, End, Kill, Join and Fork), Schedulers, Coordinators and Bundles.,to show how to schedule Sqoop Job, Hive, MR and PIG

Real world Use case which will find the top websites used by users of certain ages and will be scheduled to run for every one hour

Zoo Keeper

HBASE Integration with HIVE and PIG


Proof of concept (POC)

Spark Overview

Linking with Spark, Initializing Spark

Using the Shell

Resilient Distributed Datasets (RDDs)

Parallelized Collections

External Datasets

RDD Operations

Basics, Passing Functions to Spark

Working with Key-Value Pairs



RDD Persistence

Which Storage Level to Choose?

Removing Data

Shared Variables

Broadcast Variables


Deploying to a Cluster

Unit Testing

Migrating from pre-1.0 Versions of Spark

Where to Go from Here



Big Data Applications for the Healthcare Industry with Apache Sqoop and Apache Solr


1.The certificate rewarded by us is proof that you have taken a big leap in Big Data domain.

2. Our Specialization is exhaustive and the certificate rewarded by us is proof that you have taken a big leap in Big Data domain.

3.Differentiate yourself The knowledge you have gained from working on projects, videos, quizzes, hands-on assessments and case studies gives you a competitive edge.

4.Share your achievement Highlight your new skills on your resume, LinkedIn, Facebook and Twitter. Tell your friends and colleagues about it.
 Course Certificate Sample

Course Creators

Course Creators

Created by team of both industry & academic experts having 20+ years of rich R&D experiance


3 reviews
(4.9 out of 5)


In Online training, you will get

  • Access to live instructor-led training as per your enrolled batch
  • Learn from industry experts over online meeting tools like zoom
  • 24x7 support by the trainers.

In Class room training, you will get

  • Intensive class room 1 to 1 training by the real time experts as per your enrolled batch
  • Learn from industry experts having rich 20+ years of experience in R&D.
  • 24x7 support by the trainers.

Top industry experts with rich 20+ years of R&D experience in mentoring students across the world.

Soft copy of the course material will be mailed to you.

In online instructor-led training, team of experts will train you with a group of our course learners for 25+ hours over online conferencing software like Zoom & Webminar. Online Classes will happen every day from Monday to Friday.

At the end, of course, you will work on a real-time project. Once you are done with the project (it will be reviewed by an expert), you will be awarded a certificate which you can share on LinkedIn.

Enrollment into course entails 30 days of free access to labs depending on date of enrollment. Can be extended based on permission.

Yes, you can renew your subscription anytime. Please choose your desired plan for the lab and make payment to renew your subscription

Mail our most dynamic & ever active director through email director@vaidehisoftware.com

Have more questions? Please contact us at director@vaidehisoftware.com