Introducing Hadoop – Part II

Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it is broken down into blocks, 64 MB block size by default.These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e. there will be 3 copies of the same block in the cluster. We will see later on why we maintain replicas of the blocks in the cluster.

A Hadoop cluster can comprise of a single node (single node cluster) or thousands of nodes.

Once you have installed Hadoop you can try out the following few basic commands to work with HDFS:

  • hadoop fs -ls
  • hadoop fs -put <path_of_local> <path_in_hdfs>
  • hadoop fs -get <path_in_hdfs> <path_of_local>
  • hadoop fs -cat <path_of_file_in_hdfs>
  • hadoop fs -rmr <path_in_hdfs>

With the help of the following diagram, let us try and understand the different components of a Hadoop Cluster:

Hadoop

The above diagram depicts a 6 Node Hadoop Cluster

NameNode (Master) – NameNode, Secondary NameNode, JobTracker

DataNode 1 (Slave) – TaskTracker, DataNode

DataNode 2 (Slave) – TaskTracker, DataNode

DataNode 3 (Slave) – TaskTracker, DataNode

DataNode 4 (Slave) – TaskTracker, DataNode

DataNode 5 (Slave) – TaskTracker, DataNode

In the diagram you see that the NameNodeSecondary NameNode and theJobTracker are running on a single machine. Usually in production clusters having more that 20-30 nodes, the daemons run on separate nodes.

Hadoop follows a Master-Slave architecture. As mentioned earlier, a file in HDFS is split into blocks and replicated across Datanodes in a Hadoop cluster. You can see that the three files A, B and C have been split across with a replication factor of 3 across the different Datanodes.

Now let us go through each node and daemon:

NameNode

The NameNode in Hadoop is the node where Hadoop stores all the location information of the files in HDFS. In other words, it holds the metadata for HDFS. Whenever a file is placed in the cluster a corresponding entry of it location is maintained by the NameNode. So, for the files A, B and C we would have something as follows in the NameNode:

File A – DataNode1, DataNode2, DataNode4

File B – DataNode1, DataNode3, DataNode4

File C – DataNode2, DataNode3, DataNode4

This information is required when retrieving data from the cluster as the data is spread across multiple machines. The NameNode is a Single Point of Failure for the Hadoop Cluster.

Secondary NameNode

IMPORTANT – The Secondary NameNode is not a failover node for theNameNode.

The secondary name node is responsible for performing periodic housekeeping functions for the NameNode. It only creates checkpoints of the filesystem present in the NameNode.

DataNode

The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.

JobTracker

JobTracker is responsible for taking in requests from a client and assigningTaskTrackers with tasks to be performed. The JobTracker tries to assign tasks to the TaskTracker on the DataNode where the data is locally present (Data Locality). If that is not possible it will at least try to assign tasks to TaskTrackerswithin the same rack. If for some reason the node fails the JobTracker assigns the task to another TaskTracker where the replica of the data exists since the data blocks are replicated across the DataNodes. This ensures that the job does not fail even if a node fails within the cluster.

TaskTracker

TaskTracker is a daemon that accepts tasks (Map,Reduce and Shuffle) from the JobTracker. The TaskTracker keeps sending a heart beat message to theJobTracker to notify that it is alive. Along with the heartbeat it also sends the free slots available within it to process tasks. TaskTracker starts and monitors the Map & Reduce Tasks and sends progress/status information back to theJobTracker.

All the above daemons run within have their own JVMs.

A typical (simplified) flow in Hadoop is a follows:

  1. A Client (usaually a MapReduce program) submits a job to theJobTracker.
  2. The JobTracker get information from the NameNode on the location of the data within the DataNodes. The JobTracker places the client program (usually a jar file along with the configuration file) in the HDFS. Once placed, JobTracker tries to assign tasks to TaskTrackers on the DataNodes based on data locality.
  3. The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up the client program from the shared location on the HDFS.
  4. The progress of the operation is relayed back to the JobTracker by theTaskTracker.
  5. On completion of the Map task an intermediate file is created on the local filesystem of the TaskTracker.
  6. Results from Map tasks are then passed on to the Reduce task.
  7. The Reduce tasks works on all data received from map tasks and writes the final output to HDFS.
  8. After the task complete the intermediate data generated by theTaskTracker is deleted.

A very important feature of Hadoop to note here is, that, the program goes to where the data is and not the way around, thus resulting in efficient processing of data.

We will be seeing the detailed operations of a Mapper and Reducer in the next post where we walkthrough a MapReduce program.

End of Part II

28 Comments Introducing Hadoop – Part II

  1. Sheetal Biradar

    Hi Rohit, Thanks a lot for the information on Hadoop. We are entering into the world of Hadoop now. Can you please suggest the best versions and best configuration settings for installing Hadoop ?
    Also i see that the steps for installing Hive are given. Could you also please provide the same for HIVE,PIG, ZOOKEEPER,SQOOP,HBASE and other components ?

    Reply
    1. Rohit Menon

      Hi Sheetal,

      Selecting a version or distribution of Hadoop depends on the kind of cluster you would like to you build and the kind of processing you would be performing on it.
      Regarding instructions on installation of the other components, I plan to start http://www.hadoopscreencasts.com sometime in June that will demonstrate all the steps via video tutorials and screen recordings.
      It would be nice if you visit the site and provide your feedback and requests which will help me build better quality tutorials.

      Reply
  2. Kadambari

    Hi Rohit,

    I have a question..
    What is ‘speculative execution’ in Hadoop?
    a. Restarting a task requires communication with nodes working on other portions of
    the data
    b. If a failed node restarts, it is automatically added back to the system and assigned
    new tasks
    c. If a node appears to be running slowly, the master can redundantly execute
    another instance of the same task
    d. If a node fails, the master will detect that failure and re-assign the work to a
    different node on the system

    Reply
    1. Rohit Menon

      The answer to this question is :
      c. If a node appears to be running slowly, the master can redundantly execute another instance of the same task.

      Reply
  3. Rajan

    Hello Rohit, thanks for simplifying the Hadoop concepts here. Your article is the first article I ever read on hadoop.

    Reply
  4. Manohar

    Dear Rohit, Although the arrows are not shown for the communication from DataNode to NameNode in Topology, there is a communication from DataNode to NameNode while new output file is being created by Reducer task. Does this communication take place via Job Tracker or directly from DataNode to NameNode?

    Reply
  5. Dhiru

    Hi Rohit,

    I am working on oracle PL/SQL and have experience of 10 years. I am more focusing on Hive. Could you please suggested any other database related to hadoop technology which I can also cover like HBase, Cassandra …. etc.

    Thanks in advance
    Dhiru

    Reply
  6. Sujatha

    This is a nice narrative on hadoop introduction. I have been researching web on how single node cluster works. The diagram here describe a cluster that had multiple nodes. When I mean by nodes -> I refer to server. In a single node cluster-> Datanode, Name node , job Tracker etc run on the same machine. How it is possible. IF node is not a server what does not node refer to here. Does the Single Node Cluster replicate the data on multiple nodes. If yes, how it replicates

    Reply
    1. Rohit Menon

      Hi Sujatha,

      A node is a server. They are just two ways of saying the same thing.
      In a single node cluster there is no replication. All services run on just one node/server.
      Do let me know if you need any information.

      Reply
  7. Rajesh

    Hi ,
    I have just started hadoop.
    I want to get the data from one map/reduce output and feed it to another.
    Can you give some example on how to use 2 stage map-reduce.
    I have googled it but hadoop is too new for me to understand it (I have to apply it in my mini project).

    Reply
      1. Rajesh

        Thanks for the fast reply.
        I have already searched for that too , but didn’t got a good tutorial to understand the basic behind.
        I think I should read more.

        Thanks again.:)

        Reply
  8. Rohit

    Hi Rohit,

    Thanks. The blog is very informative.

    I just have a quick question on how the name node splits data and sends it to the data nodes.

    I read from other forums that, name node splits data into 64mb blocks and stores in different data nodes.

    If my file were have have data of about 1000 rows and about 50 columns, and if name node were to split into blocks of 64mb for the sake of argument.

    Then, will the file be split logically, meaning, 400 rows for first block, 400 rows for the second block and 200 rows for the last block ?

    Just curious to know this, because, if the files were split on converting to machine language internally as 0’s and 1’s etc, then we will not have the entire row information on one block.

    In which case, the data processing can only happen only after retrieving information from all the blocks.

    Please do let me know your thoughts. Thanks.

    – Rohit B

    PS: I am new user to hadoop, but have prior working knowledge in databases and data transformation.

    Reply
    1. Rohit Menon

      Hi Rohit,

      This is a very good question.

      Whenever a block is created it does not split based on rows. The block can split right in between the record too.
      But this handled very well at the time of reading blocks.

      When a row/line is being read in Hadoop, a line is said to be complete, only if the reader reads from the previous line delimiter to current line delimiter.

      Let us consider the following rows:
      1,The Nightmare Before Christmas,1993,3.9,4568
      2,The Mummy,1932,3.5,4388

      If this above row were to be split during block creation, the end of the first block would have something as follows:
      1,The Nightmare Before Christmas,1993,3.9,4568
      2,The Mummy,

      The second block would have:
      1932,3.5,4388

      In this case when these records are being read, the operation will happen as follows:
      The first block will be read as it is, but it knows that it has not reached a line delimiter for the 2nd row.

      1,The Nightmare Before Christmas,1993,3.9,4568
      2,The Mummy,

      And when the next block is being read, the read pointer goes back the previous line delimiter of the previous block. In this case it will go till the ‘\n’ after record 1 of the first block and read until the ‘\n’ of the first row of the second block.

      So it will be able to create the following record:
      2,The Mummy,1932,3.5,4388

      This is how the split records across blocks are read.
      I hope I was able to explain this in text.

      Do let me know if you need any information.

      Reply
      1. Kayal

        Hi,

        I found this “The RecordReader associated with TextInputFormat must be robust enough to handle the fact that the splits do not necessarily correspond neatly to line-ending boundaries. In fact, the RecordReader will read past the theoretical end of a split to the end of a line in one record. The reader associated with the next split in the file will scan for the first full line in the split to begin processing that fragment. All RecordReader implementations must use some similar logic to ensure that they do not miss records that span InputSplit boundaries.” in https://developer.yahoo.com/hadoop/tutorial/module5.html

        Please could you let me know which is correct mothod of record read?

        Thanks,
        Kayal.

        Reply
  9. sonu

    hi
    Rohit
    I am learning Hadoop as shown in diagram name node, secondary name node, Job tracker all these 3 they look physically they run on same computer but you divided into separate blocks are they run in different machines or everything in one machine.

    Reply
  10. Dharmesh Patel

    Hello sir,
    Thanks for sharing information about Hadoop in simple manner.

    I want to ask one question.

    How TaskTracker sends a Heart Beat Message to the JobTracker?
    I want to change Heart Beat Method Manually and want to update written By me.

    will you please Elaborate it?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *