I just completed the Cloudera Developer Training for Apache Hadoop course which was held in Denver, CO, USA from 4th Dec 2012 to 7th Dec 2012.
This program prepares you for the following certification:
The course was spread over 4 days that gives you a complete understanding of the Hadoop system (HDFS & MapReduce) along with a few tools that are part of the Hadoop ecosystem.
- The Motivation for Hadoop
- Hadoop: Basic Concepts
This was a good start where we were given a complete detail of the HDFS architecture and the reason behind why the system was built this way. A few examples of the practical problems that exist in the real world and how Hadoop comes to the rescue were also mentioned. A lot of emphasis was given on each and every basic daemon that runs within Hadoop and their respective responsibilities. The good thing was that the pace of the training on the first day was very slow and that helped all the people who are completely new to Hadoop get a good grasp of the basic concepts and also allowed for all doubts to be clarified in detail.
- Writing a MapReduce Program
- Unit Testing MapReduce Programs
- Delving Deeper into the Hadoop API
This day introduced us to how a MapReduce Program runs and how they can be coded and executed within the Hadoop environment. All MapReduce examples were in Java. However a brief discussion on how other languages could be used for MapReduce development (streaming) were also explained. The langauge for the streaming example was Python. A hands-on on session on using MRUnit (just like jUnit) was very useful in understanding how simple it is to unit test MapReduce code. All MapReduce code was written in Eclipse IDE which I personally found very useful.
- Practical Development Tips and Techniques
- Data Input and Output
- Common MapReduce Alogrithms
A simple demo was shown on how to use Eclipse to run and debug MapReduce programs using the LocalJobRunner mode. The exercise on using counters in MapReduce also gave a good insight on how counters are handled within Hadoop. The really useful exercise of the day was using SequenceFiles along with compression. This is really useful to merge large number of small sized files. A complete module on some common MapReduce alogrithms were also discussed. I found this part of the course could have been a bit more in detail.
- Joining Data Sets in MapReduce
- Integrating Hadoop into the Enterprise Workflow
- Machine Learning and Mahout
- An Introduction to Hive and Pig
- An Introduction to Oozie
The day started with the a detailed discussion on performing joins using MapReduce. I am not sure if anyone would really do joins in MapReduce but it was a good to understand the complexity in doing joins using MapReduce. The rest of the day was more to do with some of the tools that are part of the Hadoop ecosystem like Sqoop, Mahout, Hive and Pig. There were hands on exercises for these tools, however the last tool, Oozie was left as an exercise for home work.The exercises on Hive and Pig make it clear that one should use these tools to perform joins rather than MapReduce.
After the completion of the course Jesse (the instructor) spoke about hisMillion Monkeys project. This was very interesting. You should definitely check it out.
About the Instructor
Jesse Anderson is an Instructor and Curriculum Developer at Cloudera.
Almost at all times he was very clear and kept both, the people who were from a programming background as well as the non programmers engaged at all times. He often reiterated important points and would also make sure to brush up few concepts time and again. Overall, a fantastic instructor.
A useful tip that he gave was to get a good grip on Regular Expressions.
Do take a look at his website, he has tons of information there:
Who should take the course?
1. Anyone who would like to get into Hadoop either as developer or as solutions designer, this course is definitely a starting point to explore Hadoop’s possibilities.
2. Anyone who is evaluating Hadoop to see if it really fits their organizations needs and to see whether Hadoop fits into their ecosystem.
Would I recommend taking this course?
Yes. Before attending the course, I too had my own doubts, especially considering the course fee ($2995, at the time I took it). I paid for this myself, however, this course has definitely brought to light a lot of things that I would not have got just by reading books and the internet. The course also gives you a voucher code to appear for the certification exam.
Before attending the course I had gone through the Hadoop Tutorial from Yahoo Developer Network which allowed me to grasp concepts with ease. I highly recommend it.
Do let me know if you need any information.