[HOW TO] Install Hadoop on Ubuntu/Linux Mint

Note : You can watch the screencast for a more complete tutorial at  Hadoop Screencasts – Installing Apache Hadoop

Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.

Please let me know if you need any specific details.

Installing HDFS (Hadoop Distributed File System)
OS : Linux Mint (Ubuntu)

Installing Sun Java on Linux (Mint/Ubuntu)

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
sudo update-java-alternatives -s java-7-oracle

Create hadoop user

$sudo addgroup hadoop
$sudo adduser —ingroup hadoop hduser

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.

$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Disable IPV6

$sudo gedit /etc/sysctl.conf

This command will open sysctl.conf in text editor, you can copy the following lines at the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
$sudo sysctl -p

To make sure that IPV6 is disabled, you can run the following command:

$cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Installing Hadoop

Download hadoop from Apache Downloads.

$wget http://www.eng.lsu.edu/mirrors/apache/hadoop/core/hadoop-0.22.0/hadoop-0.22.0.tar.gz
$ cd /home/hduser
$ tar xzf hadoop-0.22.2.tar.gz
$ mv hadoop-0.22.2 hadoop

Edit .bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Update hadoop-env.sh

We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$vi /home/hduser/hadoop/conf/hadoop-env.sh

Add/update the following

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Temp directory for hadoop

$mkdir /home/hduser/tmp
$vi /home/hduser/hadoop/conf/core-site.xml

Then add the following configurations between <configuration> .. </configuration> xml elements:

<!— In: conf/core-site.xml —>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

mapred-site.xml

We will open the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)

<!— In: conf/mapred-site.xml —>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

hdfs-site.xml

Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

<!— In: conf/hdfs-site.xml —>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Formatting NameNode

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.

Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

Starting Hadoop Cluster

From hadoop/bin

./start-dfs.sh
./start-mapred.sh

Stopping Hadoop Cluster

From hadoop/bin

./stop-dfs.sh
./stop-mapred.sh

To check for processes running use:

$jps

or

$ps -eaf | grep “java”

Tasks running should be as follows:

NameNode
DataNode
SecondaryNameNode
JobTracker
TaskTracker

Example Application to test success of hadoop:

From hadoop/bin

$hadoop jar ../hadoop-mapred-examples-0.22.0.jar pi 3 10

The should complete successfully with several details and output value of pi.

References:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html

Done !!

55 Comments [HOW TO] Install Hadoop on Ubuntu/Linux Mint

  1. fasholaide

    This article is quite detailed. I love it. Nonetheless, I’ve got a problem. Each time I start hadoop with the ./start-dfs.sh command, I’m being prompted to input the root’s password which I do not have. Have you ever encountered this? Have you any solution to this? Thank you.

    Reply
    1. Rohit Menon

      Hi,

      You need to add your ssh key as part authorized keys.
      After you install the openssh-server you can perform the following steps:

      $ su – hduser
      $ ssh-keygen -t rsa -P “”
      $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

      Once you do this you should not be prompted for a password.
      I have added this to the instructions too.
      Thanks for bringing this up and making this article more useful.

      Reply
  2. Rushabh

    Hi Rohit,

    Great Efforts. But i have one question, i need to setup a cluster, so there will be master and slave, the explanation you gave is probably of pseudo mode, can you tell what changes i have to do if i need that running for multiple machines present in the cluster.

    Reply
  3. Sunny

    Hi Rohit,

    I need instructions for installing a single node cluster on ubuntu. Which steps can i follow?

    Reply
    1. Rohit Menon

      Hi Sunny,

      When all the Apache Hadoop services run on a single node (single computer) it is called a single node cluster.
      The instructions that you see on this post are the steps to install a single node cluster.

      Do let me know if you need any information.

      Reply
  4. Rajesh

    Hi Rohit,

    thanks for sharing this knowledge.

    I have a installation issue. I successfull installed hadoop as per your instrusctions. I did “hadoop version” and it showed me version and other info. as it should have.

    The issue happens when I open a new terminal window or I logout and re-login.
    I do “hadoop version” it says hadoop is not installed. Looks like I am missing something in settings.

    I am using Ubuntu 10.10.

    Any help is appreciated.
    Rajesh

    Reply
  5. Michal

    Thanks for this tutorial.

    However for the sake of completeness I have to add I needed to apt-get also java-6-jdk (not only jre). Otherwise update-java-alternatives threw a lot of warnings 🙂

    Reply
  6. Gabriela

    This is a really good tutorial, however i’m having a hard time to install hadoop. I followed all the instructions and when I’m done I tried to check the version of hadoop and it gave me a error saying that hadoop is not a command. Do you have any ideas on how to fix this?? During the installation I ran into some problems,like the .xml files werent in the exactly place that you said it would and also got some problems with the update alternatives. Let me know if you can help me 🙂

    Reply
    1. Rohit Menon

      Hi Gabriela,

      What version of Apache Hadoop are you installing. In this tutorial, I am using 0.22.2. If it is a newer version, the location of the files may have changed.
      I missed adding one line in the bashrc file.

      export PATH=$PATH:$HADOOP_HOME/bin

      Add this line to your bashrc file, and restart your terminal. This adds Hadoop to the path and you should now be able run the hadoop command.
      I have updated the blog post accordingly.

      What are the issues you are facing with update-alternatives?

      Thanks for trying out the steps and providing feedback to make this tutorial better.
      Do let me know if you need any information.

      Reply
  7. Kos

    May i know do you teach how to install hadoop on ubuntu server 10.04? As i am a student and this is my final year project. I hope you are able to help me. Thanks a lot

    Reply
      1. Kos

        Hi Rohit

        after i download the Hadoop 0.20.203.0rc1.tar.gz, i tried to un tar but it cant. Please Help.

        Thanks a lot

        Reply
  8. vineet

    Hi Rohit,
    I an new in hadoop, i try to install and it done,,,but issue is iam able to see only via 50070port from browser …i am not abe to run 50060..and 50030 menas i am not able to run my jobtracker ….

    Reply
  9. mallik

    hi Rohit,
    i am an intermediate learner for hadoop, acutally i am an research guy in the area of cloud computing. mean while i addicted to hadoop due to its tremendas features. i am planing to write cloudear 470 certifcation please help me. i am in the preparation now.
    can u send me u r personal mail and phone no if possible.
    thanks rohit

    Reply
  10. JP

    hey…

    I’ve followed your tutorial but replaced the hadoop version with 1.1.2 and the java version with openjdk6.

    I have just two issues throughout all of this…

    When I call ‘./start-dfs.sh’, I have to call it as sudo (presumably because it creates dirs and doesn’t have the permissions otherwise), is this correct?

    Secondly, when I actually call this command I have to enter root@localhost password. I haven’t set this pw and it doens’t seem to be the pw for the account I’m executing this from or the hduser account. Is there a default pw for this?

    Thanks in advance, total hadoop noob atm.

    Reply
  11. Stefan

    Hey, the version you used is not available anymore. So I had to get version 0.23.9. After extracting, there is not conf directory and no conf/hadoop-env.sh. Where I have to change the JAVA_HOME value?

    Reply
      1. Stefan

        Thank you for your streamcast. This doesn’t fix my problem. If I download and untar the tar archive hadoop-0.23.9.tar.gz
        there is no directory conf and no hadoop-env.sh. In directory etc is not hadoop-env.sh, too. If I search the untared folder for hadoop-env.sh there are just two results:
        /hadoop/sbin/update-hadoop-env.sh
        /hadoop/share/hadoop/common/templates/conf/hadoop-env.sh
        Changing the JAVA_HOME variable in the template has no effect,
        Thanks for your support

        Reply
  12. Pradeep

    Sorry Arun. Another user has asked the same question, so you can delete mine. By the way, you do you have any guess where this file would be located in the version 0.23.9?

    Reply
  13. Ravi Khatana

    Hi Rohit, thanks for very usefull info. Since there is a lot change in Hadoop 1.x and upcoming Hadoop2.x. Though I am a beginner and I try to install 2.0 beta ver. as per ur instruction…I can install it fine but at the end it have lots of change. By any chance are you looking into 2.0 ver? Here your instruction was very much in detail, do you having any study materiel?

    Reply
  14. morteza

    Hi Rohit
    After this command I recive an error.
    hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format

    THIS ERROR:
    /home/hduser/hadoop/bin/hadoop: line 350: /usr/lib/jvm/java-6-sun/bin/java: No such file or directory
    /home/hduser/hadoop/bin/hadoop: line 434: /usr/lib/jvm/java-6-sun/bin/java: No such file or directory

    Can you help me please?
    And I seen your video about installation when you edit .bashrc your file is empty but my file isnot empty.
    Is it a problem?

    Reply
  15. Rohan Frederick

    Hi Rohit,

    I am getting following while testing the pi example:-

    java.lang.RuntimeException: java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused
    at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:567)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
    at org.apache.hadoop.examples.PiEstimator.estimate(PiEstimator.java:265)
    at org.apache.hadoop.examples.PiEstimator.run(PiEstimator.java:342)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:351)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    Not able to find anything on net. Would you be able to help.

    Thanks & Regards
    Rohan

    Reply
  16. Pooja

    hi,
    I am getting the following error every time i try
    sudo tar xzf /home/dell/Desktop/hadoop-2.2.0.tar.gz

    gzip: stdin: unexpected end of file
    tar: Unexpected EOF in archive
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now

    the link provided above returns 404 error so i tried using this modified link, which worked..
    can you please guide me through this error

    Reply
  17. Stanley

    Hi Rohit, Great tutorial. I have not had any issues with the deployment. Except for running the start and stop dfs. On my hadoop install, these scripts are in the sbin subdirectory. I have one question though. I am not able to find start-mapred.sh and stop-mapred.sh in my hadoop install. I am using the following version of hadoop: hadoop-2.2.0.tar.gz. Instead, it has other shell scripts for start-yarn.sh and s
    stop-yarn.sh.. Not sure if this is correct.

    Since i cannot run start-mapred.sh I am not able to see the Jobtracker and TaskTracker running when I run the ps…. command. Can you please help?

    Reply
    1. Rohit Menon

      Hi Stanley,

      You are running Hadoop 2 and the jobtracker has been replaced with ResourceManager and NodeManager. I don;t have a blog post on installing Hadoop 2 yet, but you should be able to google it.

      Reply
  18. bnsk

    Very nice article Rohit. But, can you please tell me how to run the libhdfs test programs after installing hadoop? I keep getting segmentation fault when I try to run these programs. Thank YOu.

    Reply
  19. Surya

    Hi,
    This is Surya, new to Fedora but who loves LINUX.
    I use yum in my Linux to manage softwares. But I am not able to upgrade or install stuffs via yum because I keep getting this error message always, “Could not resolve bae URL”. Thus I decided to configure yum and used this command, “/etc/yum.conf ” even while logged in as Root user, I get the error “Permission denied” what am I to do. I run on fedora 17, and wherever I post this they keep telling me upgrade your OS, but being a beginner I don’t know to do that nor am I interested as of now. All I want to do is make my yum work better and slowly upgrade my OS later. I have been with this problem for so long so please please please please look in to this and give me an “easy to understand” and useful help. I have referred lots of other websites, help forums, books for help but all of those only resulted in vain.
    So please help me out and help me learn about LINUX more..

    Reply
  20. Deepa Deshpande

    Thanks Rohit. With your clear instructions I was able to put hadoop over Ubuntu 10 server very easily

    Reply
  21. Meenakshi Sundaram

    HI Rohit,

    Thanks a lot !

    As per your guidance, I have setup the Single node hadoop
    cluster and service is up and running 🙂
    hduser@hadoop:~/hadoop/bin$ jps
    8533 SecondaryNameNode
    8330 DataNode
    8102 NameNode
    8967 Jps
    8654 JobTracker
    8871 TaskTracker

    Need to understand few things…Please help me here

    1.Disable IPV6 – Whats the reason behind disabling the IPV6 ?? SHould this be enabled after the services are up ?

    2. while setting the mapred-site.xml how did we know the ports are 54311??

    mapred.job.tracker
    localhost:54311

    3.similiar how did know about the 54310 port number ? could not understand this

    fs.default.name
    hdfs://localhost:54310

    Reply
  22. chandan Gautam

    Fantastic Job…I installed Hadoop with few hurdle… but such a elaborated tutorial. really good work. Tutorial of this page is very nice but i was confused at some point due to your video like without going to bin you formatted namenode in video ..i don’t know how it worked..

    But anyway.. you saved my day..

    Reply
  23. Shubham

    Hi Rohit ,

    I followed ur nice steps to install Hadoop. But with 2 changes , I used Java 7 & Hadoop 2.6.0 .

    I am trying to run any example , I get error “not a Valid Jar” . Can u please help /guide me .

    Reply
    1. Rohit Menon

      Hi Shubham,

      Sorry for the late response. You would have to use the exact same steps and versions mentioned in the post. I have not tested other versions and there could be several reasons for your failure. So I am really not sure.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *