Apache Pig Tutorial – Part 1

Apache Pig is a tool used to analyze large amounts of data by represeting them as data flows. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily achieved.

Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS.

Pig was first built in Yahoo! and later became a top level Apache project. In this series of we will walk through the different features of pig using a sample dataset.

Dataset

The dataset that we are using here is from one of my projects called Flicksery. Flicksery is a Netflix Search Engine. The dataset is a simple text (movies_data.csv) file lists movie names and its details like release year, rating and runtime.

A sample of the dataset is as follows:

1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333

All code and data for this post can be downloaded from github. The file has a total of 49590 records.

Installing Pig

Download Pig

$ wget http://mirror.symnds.com/software/Apache/pig/pig-0.12.0/pig-0.12.0.tar.gz

Untar

$ tar xvzf pig-0.12.0.tar.gz

Rename to folder for easier access:

$ mv pig-0.12.0 pig

Update .bashrc to add the following:

export PATH=$PATH:/home/hduser/pig/bin

Pig can be started in one of the following two modes:

  1. Local Mode
  2. Cluster Mode

Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.

Restart your terminal and execute the pig command as follows:

To start in Local Mode:

$ pig -x local
2013-12-25 20:16:26,258 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-25 20:16:26,259 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser/pig/myscripts/pig_1388027786256.log
2013-12-25 20:16:26,281 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser/.pigbootup not found
2013-12-25 20:16:26,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
grunt>

To start in Cluster Mode:

$ pig
2013-12-25 20:19:42,274 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2013-12-25 20:19:42,274 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser/pig/myscripts/pig_1388027982272.log
2013-12-25 20:19:42,300 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser/.pigbootup not found
2013-12-25 20:19:42,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310
2013-12-25 20:19:42,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: hdfs://localhost:9001
grunt>

This command presents you with a grunt shell. The grunt shell allows you to execute PigLatin statements to quickly test out data flows on your data step by step without having to execute complete scripts. Pig is now installed and we can go ahead and start using Pig to play with data.

Pig Latin

To learn Pig Latin, let’s question the data. Before we start asking questions, we need the data to be accessible in Pig.

Use the following command to load the data:

grunt> movies = LOAD '/home/hduser/pig/myscripts/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

The above statement is made up of two parts. The part to the left of “=” is called the relation or alias. It looks like a variable but you should note that this is not a variable. When this statement is executed, no MapReduce task is executed.

Since our dataset has records with fields separated by a comma we use the keyword USING PigStorage(‘,’).
Another thing we have done in the above statement is giving the names to the fields using the ‘as’ keyword.

Now, let’s test to see if the alias has the data we loaded.

grunt> DUMP movies;

Once, you execute the above statement, you should see lot of text on the screen (partial text shown below).

2013-12-25 23:03:04,550 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2013-12-25 23:03:04,633 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2013-12-25 23:03:04,748 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-12-25 23:03:04,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2013-12-25 23:03:04,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2013-12-25 23:03:04,853 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

................

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.2 0.12.0 hduser 2013-12-25 23:03:04 2013-12-25 23:03:05 UNKNOWN

Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 movies MAP_ONLY file:/tmp/temp-1685410826/tmp1113990343,

Input(s):
Successfully read records from: "/home/hduser/pig/myscripts/movies_data.csv"

Output(s):
Successfully stored records in: "file:/tmp/temp-1685410826/tmp1113990343"

Job DAG:
job_local_0001

................

(49586,Winter Wonderland,2013,2.8,1812)
(49587,Top Gear: Series 19: Africa Special,2013,,6822)
(49588,Fireplace For Your Home: Crackling Fireplace with Music,2010,,3610)
(49589,Kate Plus Ei8ht,2010,2.7,)
(49590,Kate Plus Ei8ht: Season 1,2010,2.7,)

It is only after the DUMP statement that a MapReduce job is initiated. As we see our data in the output we can confirm that the data has been loaded successfully.

Now, since we have the data in Pig, let’s start with the questions.

List the movies that having a rating greater than 4

grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
grunt> DUMP movies_greater_than_four;

The above statements filters the alias movies and store the results in a new alias movies_greater_than_four. The movies_greater_than_four alias will have only records of movies where the rating is greater than 4.

The DUMP command is only used to display information onto the standard output. If you need to store the data to a file you can use the following command:

grunt> store movies_greater_than_four into '/user/hduser/movies_greater_than_four';

In this post we got a good feel of Apache Pig. We loaded some data and executed some basic commands to query it. The next post will dive deeper into Pig Latin where we will learn some advanced techniques to do data analysis.

10 Comments Apache Pig Tutorial – Part 1

  1. Pingback: Apache Pig Tutorial – Part 2 | Rohit Menon

  2. Great_Raisin

    Hey Rohit, loved this post. It’s a very nice intro to Pig for noobs. Just one thing – when you’re storing data to a file, you use ‘into’ and not ‘in’.

    Reply
  3. NK

    Very good introduction.
    I had one correction that I wanted to point out.

    The below line needs to have INTO instead IN

    grunt> store movies_greater_than_four in ‘/user/hduser/movies_greater_than_four’;

    Reply
  4. Suneel

    Having knowledge is different, presenting in such a way of easy understanding and leaves a print of content in reader mind is the thing!..

    You Rockzz!…

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>