Apache Sqoop

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.- from http://sqoop.apache.org/

In this post we will get hands on with Sqoop and perform imports and exports of data to and from HDFS repectively.┬áSo let’s get started.

Installation

Download Apache Sqoop using the following command:

$ wget http://mirrors.gigenet.com/apache/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

Untar the file:

$ tar xvzf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

Rename the folder to something simpler:

 mv sqoop-1.4.4.bin__hadoop-1.0.0 sqoop

Make sure the following variables are set for the session. You can add them to your .bashrc file

export HADOOP_COMMON_HOME=/home/hduser/hadoop
export HADOOP_MAPRED_HOME=/home/hduser/hadoop

I just pointed them to the Hadoop home directory location.

Importing data from MySql

For this post I have a database by the name, sqoop_test. The database has a table by the name, movies which contains two columns : movie and rating.
I would like to import the data to HDFS using Sqoop.

mysql> select * from movies;
+--------------------------------+--------+
| movie | rating |
+--------------------------------+--------+
| The Nightmare Before Christmas | 3.4 |
| The Mummy | 2.6 |
| Orphans of the Storm | 4.5 |
| The Object of Beauty | 3 |
| Night Tide | 2 |
| One Magic Christmas | 1.2 |
| Nosferatu: Original Version | 2.3 |
| Nick of Time | 3.6 |
+--------------------------------+--------+
8 rows in set (0.00 sec)

Before we import we will need to get the Mysql JDBC driver from here.

I downloaded mysql-connector-java-5.1.28-bin.jar and placed it under /home/hduser/sqoop/lib/

To import the data use the following command:

$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --table movies -m 1

Output:

13/12/24 23:21:22 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/12/24 23:21:22 INFO tool.CodeGenTool: Beginning code generation
13/12/24 23:21:22 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `movies` AS t LIMIT 1
13/12/24 23:21:22 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `movies` AS t LIMIT 1
13/12/24 23:21:22 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hduser/hadoop
Note: /tmp/sqoop-hduser/compile/1120535a7891cfa30210e48b3ed06237/movies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/12/24 23:21:23 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/1120535a7891cfa30210e48b3ed06237/movies.jar
13/12/24 23:21:23 WARN manager.MySQLManager: It looks like you are importing from mysql.
13/12/24 23:21:23 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
13/12/24 23:21:23 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
13/12/24 23:21:23 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
13/12/24 23:21:23 INFO mapreduce.ImportJobBase: Beginning import of movies
13/12/24 23:21:25 INFO mapred.JobClient: Running job: job_201312241604_0001
13/12/24 23:21:26 INFO mapred.JobClient: map 0% reduce 0%
13/12/24 23:21:32 INFO mapred.JobClient: map 100% reduce 0%
13/12/24 23:21:32 INFO mapred.JobClient: Job complete: job_201312241604_0001
13/12/24 23:21:32 INFO mapred.JobClient: Counters: 18
13/12/24 23:21:32 INFO mapred.JobClient: Job Counters
13/12/24 23:21:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4705
13/12/24 23:21:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/24 23:21:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/24 23:21:32 INFO mapred.JobClient: Launched map tasks=1
13/12/24 23:21:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/12/24 23:21:32 INFO mapred.JobClient: File Output Format Counters
13/12/24 23:21:32 INFO mapred.JobClient: Bytes Written=183
13/12/24 23:21:32 INFO mapred.JobClient: FileSystemCounters
13/12/24 23:21:32 INFO mapred.JobClient: HDFS_BYTES_READ=87
13/12/24 23:21:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=60921
13/12/24 23:21:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=183
13/12/24 23:21:32 INFO mapred.JobClient: File Input Format Counters
13/12/24 23:21:32 INFO mapred.JobClient: Bytes Read=0
13/12/24 23:21:32 INFO mapred.JobClient: Map-Reduce Framework
13/12/24 23:21:32 INFO mapred.JobClient: Map input records=8
13/12/24 23:21:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=35909632
13/12/24 23:21:32 INFO mapred.JobClient: Spilled Records=0
13/12/24 23:21:32 INFO mapred.JobClient: CPU time spent (ms)=330
13/12/24 23:21:32 INFO mapred.JobClient: Total committed heap usage (bytes)=16252928
13/12/24 23:21:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=345686016
13/12/24 23:21:32 INFO mapred.JobClient: Map output records=8
13/12/24 23:21:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=87
13/12/24 23:21:32 INFO mapreduce.ImportJobBase: Transferred 183 bytes in 8.3529 seconds (21.9085 bytes/sec)
13/12/24 23:21:32 INFO mapreduce.ImportJobBase: Retrieved 8 records.

As per the logs all 8 records form the table have been retrieved. Let’s look at it from HDFS:

$ hadoop fs -cat /user/hduser/movies/

The Nightmare Before Christmas,3.4
The Mummy,2.6
Orphans of the Storm,4.5
The Object of Beauty,3
Night Tide,2
One Magic Christmas,1.2
Nosferatu: Original Version,2.3
Nick of Time,3.6

We have successfully imported data from MySql to HDFS.

Exporting data

Let’s use Sqoop to export data to MySql.

The data I would like to import looks as follows:

The Shawshank Redemption,3.4
Rockstar,2.6
The Rear Window,4.5
Beauty and the Beast,3
Galdiator,3
Nowhere to Run,3.2
Fargo,3.3
Next,3.6

The filename is movies_export.csv and is under the location /user/hduser/exports/

I have a table movies_export with as follows:

mysql> desc movies_export;
+--------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| movie | varchar(255) | YES | | NULL | |
| rating | varchar(255) | YES | | NULL | |
+--------+--------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

As of now there are no rows in the table:

mysql> select * from movies_export;
Empty set (0.00 sec)

Use the following command to export the data to MySql:

$ sqoop export --connect jdbc:mysql://localhost/sqoop_test --table movies_export --export-dir '/user/hduser/exports' -m 1;

Output:

13/12/24 23:38:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/12/24 23:38:50 INFO tool.CodeGenTool: Beginning code generation
13/12/24 23:38:50 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `movies_export` AS t LIMIT 1
13/12/24 23:38:50 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `movies_export` AS t LIMIT 1
13/12/24 23:38:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hduser/hadoop
Note: /tmp/sqoop-hduser/compile/2b3381307e097e38b60b0e204c1b7a68/movies_export.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/12/24 23:38:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/2b3381307e097e38b60b0e204c1b7a68/movies_export.jar
13/12/24 23:38:51 INFO mapreduce.ExportJobBase: Beginning export of movies_export
13/12/24 23:38:52 INFO input.FileInputFormat: Total input paths to process : 1
13/12/24 23:38:52 INFO input.FileInputFormat: Total input paths to process : 1
13/12/24 23:38:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/24 23:38:52 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/24 23:38:52 INFO mapred.JobClient: Running job: job_201312241604_0002
13/12/24 23:38:53 INFO mapred.JobClient: map 0% reduce 0%
13/12/24 23:38:57 INFO mapred.JobClient: map 100% reduce 0%
13/12/24 23:38:57 INFO mapred.JobClient: Job complete: job_201312241604_0002
13/12/24 23:38:57 INFO mapred.JobClient: Counters: 18
13/12/24 23:38:57 INFO mapred.JobClient: Job Counters
13/12/24 23:38:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3536
13/12/24 23:38:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/24 23:38:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/24 23:38:57 INFO mapred.JobClient: Launched map tasks=1
13/12/24 23:38:57 INFO mapred.JobClient: Data-local map tasks=1
13/12/24 23:38:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/12/24 23:38:57 INFO mapred.JobClient: File Output Format Counters
13/12/24 23:38:57 INFO mapred.JobClient: Bytes Written=0
13/12/24 23:38:57 INFO mapred.JobClient: FileSystemCounters
13/12/24 23:38:57 INFO mapred.JobClient: HDFS_BYTES_READ=279
13/12/24 23:38:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=60753
13/12/24 23:38:57 INFO mapred.JobClient: File Input Format Counters
13/12/24 23:38:57 INFO mapred.JobClient: Bytes Read=0
13/12/24 23:38:57 INFO mapred.JobClient: Map-Reduce Framework
13/12/24 23:38:57 INFO mapred.JobClient: Map input records=8
13/12/24 23:38:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=35356672
13/12/24 23:38:57 INFO mapred.JobClient: Spilled Records=0
13/12/24 23:38:57 INFO mapred.JobClient: CPU time spent (ms)=300
13/12/24 23:38:57 INFO mapred.JobClient: Total committed heap usage (bytes)=16252928
13/12/24 23:38:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=345022464
13/12/24 23:38:57 INFO mapred.JobClient: Map output records=8
13/12/24 23:38:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=141
13/12/24 23:38:57 INFO mapreduce.ExportJobBase: Transferred 279 bytes in 5.7845 seconds (48.2327 bytes/sec)
13/12/24 23:38:57 INFO mapreduce.ExportJobBase: Exported 8 records.

As per the logs, 8 records from the file have been exported to MySql. Let’s verify it by querying the table:

mysql> select * from movies_export;
+--------------------------+--------+
| movie | rating |
+--------------------------+--------+
| The Shawshank Redemption | 3.4 |
| Rockstar | 2.6 |
| The Rear Window | 4.5 |
| Beauty and the Beast | 3 |
| Galdiator | 3 |
| Nowhere to Run | 3.2 |
| Fargo | 3.3 |
| Next | 3.6 |
+--------------------------+--------+
8 rows in set (0.00 sec)

We have successfully exported data from HDFS to MySql using Sqoop.

This was just an introduction. Sqoop sports tons of features and to read more about it you can read the official Sqoop documentation at : http://sqoop.apache.org/

Enjoy !

3 Comments Apache Sqoop

  1. Sarab Polakam

    Excellent tutorials. All most everything worked for me and learned in no time. Thank you very much for putting these together.

    Reply
  2. raijv

    HI Rohit ,

    Your tutorials are simply awesome , keep doing good job .

    I have got sense of all these frame works in just a couple of hours

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>