Friday, February 24, 2012

Setup Cloudera Hadoop in combination with Oracle virtualization

One of the things Cloudera is propagating is that they have a very easy to use and easy to start implementation of Apache Hadoop. If you check the Cloudera website you have a download section where you can download CDH3.

"CDH consists of 100% open source Apache Hadoop plus nine other open source projects from the Hadoop ecosystem. CDH is thoroughly tested and certified to integrate with the widest range of operating systems and hardware, databases and data warehouses, and business intelligence and ETL systems."

You can deploy it in several ways and the most easy one for people who do start testing with Cloudera and Apache Hadoop is to use one of the pre-created virtual machines. Currently they are available for KVM, VMWare and Oracle VirtualBox. Below is a very quick step by step guide on how you can start using the downloaded Cloudera distribution within Oracle VirtualBox Reason for this, there are some guides on "old" versions of virtualbox and when I do refer someone to a step by step guide I would like that guide to be accurate.

When you have download the Cloudera distribution you will need to unpack the downloaded ..tar.gz file as you would normally do and store the resulting .vmdk file (probably named cloudera-demo-vm.vmdk) at a location where you normally save your virtual machines.

Step 1:
Start VirtualBox and click the "new" button to start the creation of a new virtual machine. 

Step 2:
Give you new, to be created, virtual machine a name. In our case this was Cloudera_0. You have to select a operating system and a version. In the screenshot below you see I have selected Debian 64Bit this however is wrong. It is working however the distribution officially used by Cloudera in this release is a CentOS 5.7 64Bit version using a kernel version 2.6.18-274.17.1.el5 .

Step 3:
You have to state the amount of memory. Cloudera claims you can run the system 1 GB however recommends at least 2 GB to be able to start everything properly. In the below screenshot you can see I am using 2048 MB however I did double that after playing with the system for some time as more memory if quite convenient

Step 4:
Now it is time to select your hard disk. For this you have to select the .vmdk file. Within this file is the complete Cloudera distribution with Apache Hadoop. Their is no need to create a new disk.

Step 5:
Now you will see the final results and when you select create your virtual machine will be created.

Step 6:
Your VirtualBox is created, when you select the newly created Cloudera virtual machine and start it you will see the system boot and within no-time you will have your first Cloudera instance up and running.

No comments: