Introducing MapReduce – Part I

MapReduce is the programming model to work on data within the HDFS. The programming language for MapReduce is Java. Hadoop also provides streaming wherein other langauges could also be used to write MapReduce programs. All data emitted in the flow of a MapReduce program is in the form of <Key,Value> pairs.

We have seen in the previous post a typical flow for the Hadoop system. Here we will break down the MapReduce program and try and understand each part in detail.

A MapReduce program consists of the following 3 parts :

  1. Driver
  2. Mapper
  3. Reducer

Driver

The Driver code runs on the client machine and is responsible for building the configuration of the job and submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from the command line.

Some of the common libraries that are included for the Driver class :

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

In most cases, the command line parameters passed to the Driver program are the paths to the directory where containing the input files and the path to the output directory. Both these path locations are from the HDFS. The output location should not be present before running the program as it is created after the execution of the program. If the output location already exists the program will exit with an error.

The next step the Driver program should do is to configure the Job that needs to be submitted to the cluster. To do this we create an object of type JobConf and pass the name of the Driver class. The JobConf class allows you to configure the different properties for the Mapper, Combiner, Partitioner, Reducer, InputFormat and OutputFormat.

Sample

public class MyDriver{
public static void main(String[] args) throws Exception {
// Create the JobConf object
JobConf conf = new JobConf(MyDriver.class);

// Set the name of the Job
conf.setJobName(“SampleJobName”);

// Set the output Key type for the Mapper
conf.setMapOutputKeyClass(Text.class);

// Set the output Value type for the Mapper
conf.setMapOutputValueClass(IntWritable.class);

// Set the output Key type for the Reducer
conf.setOutputKeyClass(Text.class);

// Set the output Value type for the Reducer
conf.setOutputValueClass(IntWritable.class);

// Set the Mapper Class
conf.setMapperClass(MyMapper.class);

// Set the Reducer Class
conf.setReducerClass(Reducer.class);

// Set the format of the input that will be provided to the program
conf.setInputFormat(TextInputFormat.class);

// Set the format of the output for the program
conf.setOutputFormat(TextOutputFormat.class);

// Set the location from where the Mapper will read the input
FileInputFormat.setInputPaths(conf, new Path(args[0]));

// Set the location where the Reducer will write the output
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

// Run the job on the cluster
JobClient.runJob(conf);
}
}

Mapper

The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.

Some of the common libraries that are included for the Mapper class :

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

 

Sample

public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException {
output.collect(key,value);
    }

  }
}

The map() function accepts the key, value, OutputCollector and an Reporter object. The OutputCollector is resposible for writing the intermediate data generated by the Mapper.

Reducer

The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types. The keys are WritableComparables, the values are Writables.

Some of the common libraries that are included for the Reducer class :

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

Sample

public class MyReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
   output.collect(key,value);
    }

  }
}

The reduce() function accepts the key, an iterator , OutputCollector and an Reporter object. The OutputCollector is resposible for writing the final output result.

Next, we shall see a how to write a complete MapReduce application.

12 Comments Introducing MapReduce – Part I

  1. Hadoop Fresher

    Hello Arun,
    The articles you have written is too precise and good. I do have few technical questions and i will appreciate if you can clarify the same.

    1) For clearing Cloudera Developer Certification (CCD410) do we need to know the Old API or new API is sufficient

    2) I do have good understanding of Yahoo Tutorial (Hadoop) so do you think it is sufficient to crack the exam or any other books needs to be read.

    Reply
    1. Rohit Menon

      Hi Hadoop Fresher,

      I am not sure who Arun is, but if it is the author of this blog that you wanted to contact, then that is me, Rohit.

      1) When I appeared for the test (Dec 2012) I did not have to learn much about the new API. The questions are more on the concepts of Hadoop, and not the API.
      2) The Yahoo Tutorial is a really good source, but I would also recommend reading Hadoop in Action, by Chuck Lam.

      Do let me know if you need any information.

      Reply
  2. Hadoop Fresher

    Hello Rohit,
    Sorry for the typo regarding your name and thanks for the clarification. Your tutorials are excellent.

    Reply
  3. Gangadhar

    Hello,
    Actually i am new to Hadoop. but when i read this blog post .i am confidunt to learn hadoop.
    thanks.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *