Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it is broken down into blocks, 64 MB block size by default.These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e. there will be 3 copies of the same block in the cluster. We will see later on why we maintain replicas of the blocks in the cluster.
A Hadoop cluster can comprise of a single node (single node cluster) or thousands of nodes.
Once you have installed Hadoop you can try out the following few basic commands to work with HDFS:
- hadoop fs -ls
- hadoop fs -put <path_of_local> <path_in_hdfs>
- hadoop fs -get <path_in_hdfs> <path_of_local>
- hadoop fs -cat <path_of_file_in_hdfs>
- hadoop fs -rmr <path_in_hdfs>
With the help of the following diagram, let us try and understand the different components of a Hadoop Cluster:
The above diagram depicts a 6 Node Hadoop Cluster
NameNode (Master) – NameNode, Secondary NameNode, JobTracker
DataNode 1 (Slave) – TaskTracker, DataNode
DataNode 2 (Slave) – TaskTracker, DataNode
DataNode 3 (Slave) – TaskTracker, DataNode
DataNode 4 (Slave) – TaskTracker, DataNode
DataNode 5 (Slave) – TaskTracker, DataNode
In the diagram you see that the NameNode, Secondary NameNode and theJobTracker are running on a single machine. Usually in production clusters having more that 20-30 nodes, the daemons run on separate nodes.
Hadoop follows a Master-Slave architecture. As mentioned earlier, a file in HDFS is split into blocks and replicated across Datanodes in a Hadoop cluster. You can see that the three files A, B and C have been split across with a replication factor of 3 across the different Datanodes.
Now let us go through each node and daemon:
The NameNode in Hadoop is the node where Hadoop stores all the location information of the files in HDFS. In other words, it holds the metadata for HDFS. Whenever a file is placed in the cluster a corresponding entry of it location is maintained by the NameNode. So, for the files A, B and C we would have something as follows in the NameNode:
File A – DataNode1, DataNode2, DataNode4
File B – DataNode1, DataNode3, DataNode4
File C – DataNode2, DataNode3, DataNode4
This information is required when retrieving data from the cluster as the data is spread across multiple machines. The NameNode is a Single Point of Failure for the Hadoop Cluster.
IMPORTANT – The Secondary NameNode is not a failover node for theNameNode.
The secondary name node is responsible for performing periodic housekeeping functions for the NameNode. It only creates checkpoints of the filesystem present in the NameNode.
The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.
JobTracker is responsible for taking in requests from a client and assigningTaskTrackers with tasks to be performed. The JobTracker tries to assign tasks to the TaskTracker on the DataNode where the data is locally present (Data Locality). If that is not possible it will at least try to assign tasks to TaskTrackerswithin the same rack. If for some reason the node fails the JobTracker assigns the task to another TaskTracker where the replica of the data exists since the data blocks are replicated across the DataNodes. This ensures that the job does not fail even if a node fails within the cluster.
TaskTracker is a daemon that accepts tasks (Map,Reduce and Shuffle) from the JobTracker. The TaskTracker keeps sending a heart beat message to theJobTracker to notify that it is alive. Along with the heartbeat it also sends the free slots available within it to process tasks. TaskTracker starts and monitors the Map & Reduce Tasks and sends progress/status information back to theJobTracker.
All the above daemons run within have their own JVMs.
A typical (simplified) flow in Hadoop is a follows:
- A Client (usaually a MapReduce program) submits a job to theJobTracker.
- The JobTracker get information from the NameNode on the location of the data within the DataNodes. The JobTracker places the client program (usually a jar file along with the configuration file) in the HDFS. Once placed, JobTracker tries to assign tasks to TaskTrackers on the DataNodes based on data locality.
- The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up the client program from the shared location on the HDFS.
- The progress of the operation is relayed back to the JobTracker by theTaskTracker.
- On completion of the Map task an intermediate file is created on the local filesystem of the TaskTracker.
- Results from Map tasks are then passed on to the Reduce task.
- The Reduce tasks works on all data received from map tasks and writes the final output to HDFS.
- After the task complete the intermediate data generated by theTaskTracker is deleted.
A very important feature of Hadoop to note here is, that, the program goes to where the data is and not the way around, thus resulting in efficient processing of data.
We will be seeing the detailed operations of a Mapper and Reducer in the next post where we walkthrough a MapReduce program.
End of Part II