Introducing MapReduce – Part II (Code Listing)

In this post we shall see go through a full working example for MapReduce. If you have not read Introducing MapReduce – Part I please read it before reading ahead.

Word counting using MapReduce

Word counting, as most people say is the Hello,World program for MapReduce. It helps us understand the flow of data as <key, value> pairs very clearly.

As mentioned in Introducing MapReduce – Part I, a typical MapReduce program is made up of 3 parts:

  1. Driver
  2. Mapper
  3. Reducer

For this example let us consider the following input:

she sells sea shells on the sea shore

The desired output:

she 1
sells 1
sea 2
shells 1
on 1
the 1
sea 1
shore 1

Mapper for Word Count

package wordcount;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{
	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
		StringTokenizer st = new StringTokenizer(value.toString().toLowerCase());
		while(st.hasMoreTokens()) {
			output.collect(new Text(st.nextToken()), new IntWritable(1));
		}
	}
}

When the Mapper receives this line each and every word of the sentence is broken down into <key, value> pairs.

So, for the above mentioned input line the following <key, value> pairs will be generated:

<she,1>
<sells,1>
<sea,1>
<shells,1>
<on,1>
<the,1>
<sea,1>
<shore,1>

Reducer for Word Count

package wordcount;

import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{
	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
		int count = 0;
		while(values.hasNext()) {
			count += values.next().get();
		}
		output.collect(key, new IntWritable(count));
	}
}

The reducer receives all the key value pairs for in the sorted order of the key. As you can see above the reduce method takes in all <key, value> pairs and increments a counter for values with the same key and emits it out. So the output of the reducer will be as follows:

<on, 1>
<sea, 2>
<sells, 1>
<she, 1>
<shells, 1>
<shore, 1>
<the, 1>

Driver for Word Count

package wordcount;

import java.io.*;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;

public class WordCount extends Configured implements Tool{

	public int run(String[] args) throws IOException{
		JobConf conf = new JobConf(WordCount.class);
		conf.setJobName("wordcount");

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(IntWritable.class);

		conf.setMapperClass(WordCountMapper.class);
		conf.setReducerClass(WordCountReducer.class);

		conf.setInputFormat(TextInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);

		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));

		JobClient.runJob(conf);

		return 0;
	}

	public static void main(String[] args) throws Exception {
		int exitCode = ToolRunner.run(new WordCount(), args);
		System.exit(exitCode);
	}

}

The driver defines the configuration for the job that is submitted to the Hadoop cluster. All of the above configurations have been explained in Introducing MapReduce – Part I.

Source Code on GitHub : https://github.com/rohitsden/wordcount.git

 

5 Comments Introducing MapReduce – Part II (Code Listing)

  1. Melkor

    Hello. I need to do a word count, but that the number of occurrences ordered every word. I know I have to make 2 mapreduce .. but I have no success doing it. Any example or something that can help me?
    My email is melkorro@hotmail.com
    Thank you!

    Reply
      1. jose

        hi ,
        can you please try some python codes , am not so good in java and started learning map reduce with python streaming ….

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *