The amount of data generated by enterprises every day is mind boggling. Roughly, 2.5 exabytes (2.5 x 10^8) of data are created every day. With so much amount of data, the obvious questions that would arise will be about- the storage, transfer and analyzation of data. Big Data refers to the data sets so large that the traditional methods seem deficient to handle it, there is a growing demand for individuals with Big Data skills in the industry.

 

Hadoop is an open source framework designed to store and process Big data. Hadoop is designed to be scalable, from single to thousands of machines. You will learn about the components involved in the Hadoop ecosystem and the execution environment. By the end of this course, you will be comfortable with the techniques involved in data analysis and the technical jargon of Big data analytics.

 

Batches

No batch is available

What you'll learn?

Why Hadoop?

This section will cover the basics of Big Data and the installation of Hadoop.
 
  • What is Big Data?
  • Processing Big Data in a non-hadoop way
  • Disadvantages of the non-hadoop way of processing Big Data
  • Advantages of using Hadoop to process Big Data
  • Installation of Hadoop in Standalone mode

MapReduce

You can learn to process and generate large data sets with a parallel, distributed algorithm on a cluster.
 
  • Mapper class
  • Reducer class
  • Project : Weather Sensor Data Analysis
  • Process Big data in Hadoop using MapReduce
  • Need for the Combiner class
  • Project : Distinct elements in a stream and Count distinct problem. Resolving the same using MapReduce
  • ControlledJob and JobControl for multiple MapReduce stages
  • Creating and using Custom Writables in MapReduce jobs
  • SequenceFile input and output format
  • Project : Number of times distinct elements have repeated in a stream. Website user logs problem and resolving the same using MapReduce
  • Project : Blog Post Data processing
  • Joining pattern in MapReduce

Hadoop Distributed File System (HDFS)

You will learn the scalable and reliable method of storing Big Data.
 
  • Need for a distributed file system
  • Difference between HDFS and other file systems
  • NameNodes and DataNodes

Yet Another Resource Negotiator (YARN)

It manages the allocation of resources in a clustered environment. We will learn more about its functions and details in this section.
 
  • Basic YARN architecture
  • Resource Manager and Node Manager
  • MapReduce on YARN
  • Setting up Hadoop in Pseudo-Distributed Mode

Apache Pig

Learn the high-level platform that makes programming much easier and the language – Pig Latin – which is similar to SQL.
 
  • Need for Pig
  • Installation of Pig
  • Pig Latin and the Grunt shell
  • Pig Data types - Scalar and Complex types
  • Pig Operators
  • Pig Functions
  • Running pig programs from .pig files
  • Running pig in standalone and Clustered Mode (MapReduce mode)
  • Project : Matrix Multiplication using Pig on Clustered mode of Hadoop

 

Apache Hive

This section will cover the Apache Hive Data warehouse which provides the HiveQL language.
 
  • Need for Hive
  • Installation of Hive
  • The Hive metastore
  • Hive Data types - Simple and Complex
  • Hive Tables - Managed and External
  • Hive Query Language (HiveQL)
  • The Hive warehouse
  • Load data into hive from local file system or HDFS
  • Hive Scripts and running Hive Scripts
  • Hive Configuration file (hive-site.xml)
  • Hive Partitions : Static and Dynamic Partitioning
  • Setting up Hive in a hadoop cluster
  • Sharing the Hive metastore using Mysql Database
  • Project : Weather Sensor Data Analysis using Hive

Setting up Hadoop Cluster

  • Configuring Hadoop in Fully Distributed mode
  • Setting up a cluster of machines
  • Project : Running Weather Sensor Data Analysis and Count distinct elements in a stream, on Hadoop Cluster

NoSQL

  • Problems with existing RDBMS solutions for managing data
  • Introduction to NoSQL
  • Difference between NoSQL and RDBMS solutions
  • Column Store based NoSQL solution

Apache HBase

It is a non- relational and distributed databases which could be used for real-time data read/write
  • Need for HBase
  • A Column Store NoSQL solution
  • HBase Data Model
  • Installing HBase
  • HBase Shell Commands
  • HBase Java client API
  • Intelligent generation of rowkeys in HBase tables
  • Install HBase in Clustered Mode
  • HBase Architecture
  • Project : User and Blog data real time read/write using HBase

Our Students Love Us