Monday, 25 September 2017

Introduction To Hadoop

What is Hadoop?

Hadoop is open-source software for reliable, scalable, distributed computing.

Hadoop is an open source, Java-based programming framework that supports the processing and storage of very huge data sets in a distributed computing environment. 

It is part of the Apache project sponsored by the Apache Software Foundation
Hadoop dedicated to store and analyze the large sets of unstructured data
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. 

Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in equivalent. 

This approach takes benefit of data locality, where nodes influence the data they have access to. 

This allows the dataset to be processed quicker and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The 4 Modules of Hadoop
  1. Hadoop Distributed File-System (HDFS)
  2. MapReduce
  3. Hadoop Common
  4. YARN

Hadoop Distributed File-System (HDFS) :

A "file system" is the method used by a computer to store data, so it can be found and used. Usually this is determined by the computer's operating system, however a Hadoop system uses its own file system which meets "above" the file system of the host computer - meaning it can be accessed using any computer running any supported OS

MapReduce :

an implementation of the MapReduce programming model for major data processing.

Mapreduce has two operation defined as in name :  reading data from the database, putting it into a format appropriate for analysis (map), and performing mathematical operations i.e counting the number of males aged 29+ in a customer database (reduce).

Hadoop Common :

contains libraries and utilities needed by other Hadoop modules.

hadoop common provides the tools (in Java) needed for the user's computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.


a platform responsible for managing computing resources in clusters and using them for scheduling users' applications. 

which manages resources of the systems storing the data and running the analysis.

Learn more from hadoop tutorial
Hadoop introduction


  1. Informative post about hadoop, i am looking forward for realtime hadoop online training institute.

  2. Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
    hadoop training in bangalore

  3. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Java Training in Marathahalli

  4. It has been just unfathomably liberal with you to give straightforwardly what precisely numerous people would've promoted for an eBook to wind up making some money for their end, basically given that you could have attempted it in the occasion you needed.
    Hadoop Training in Marathahalli