Friday, 29 September 2017

Hadoop - HDFS Overview & Architecture

What is HDFS?

A "file system" is the method used by a computer to store data, so it can be found and used. Usually this is determined by the computer's operating system, however a Hadoop system uses its own file system which meets "above" the file system of the host computer - meaning it can be accessed using any computer running any supported OS

Hadoop Architecture

Given below is the architecture of a Hadoop File System.

Namenode :

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

The HDFS namespace is a hierarchy of files and directories. Files and directories are characterized on the NameNode by inodes. 

Inodes record attributes like permissions, modification and access times, namespace and disk space allocations. 

The file content is divide into bog blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is separately replicated at multiple DataNodes (typically three, but user selectable file-by-file). 

The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. 

The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may perform multiple application tasks concurrently.


A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. 

On startup, a DataNode connects to the NameNode; rotating until that service comes up. It then reacts to requests from the NameNode for filesystem operations
It is the name of the background process which runs on the slave node. 

It is responsible for storing and managing the actual data on the slave node.
The client writes data to one slave node and then it is responsibility of Datanode to replicates data to the slave nodes according to replication factor.

Block :

Generally the user data is stored in the files of HDFS. The file in a file system will be separated into one or more segments and/or stored in independently data nodes. 

These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be improved as per the need to change in HDFS configuration.

Learn more from hadoop tutorial

HDFS Architecture


  1. Really useful information about hadoop, i have to know information about hadoop online training institutes.