Here I test drive the process of automatically configuring a Hadoop cluster in virtual machines for a fully distributed mode.
First of all make sure you have Ruby installed. I’m testing with Ruby 1.9.3. You should also have Virtual Box installed. I have version 4.1.
Then from the command line install the vagrant gem:
gem install vagrant
Vagrant is a great tool that allow us to manage our Virtual Box machines using the command line and simple configuration files.
First we will install a linux Ubuntu virtual machine (or a box as it is called in vagrant)
vagrant box add base-hadoop http://files.vagrantup.com/lucid64.box
Then we go to a directory where we want to have our “workspace” and also the directory to create the vagrant configuration file for our new box and execute. This will create a Vagrantfile file with the vagrant configuration.
vagrant init base-hadoop
The virtual machine is ready to be started up now. You can start it by doing:
That is the virtual machine running. You can connect to it with ssh. type
Next step is to download Puppet. Do that going to the URL http://puppetlabs.com/misc/download-options/
Puppet is a tool that allow us to automate the process of provisioning servers. We will use it to manage our virtual machines, installing the required software on them and executing the required services.
So we create a directory where we are going to put our manifests (puppet configuration files)
in that new directory we create a file called base-hadoop.pp with the following content:
In the Vagrantfile file that got created previously we uncomment the lines that look like:
The next thing we need to do is tell puppet to install Java in our servers. for that we open the base-hadoop.pp file and add the following:
Next thing we need to install hadoop. For this we will create a new puppet module. A puppet module is used to encapsulate resources that belong to the same component.
mkdir -p modules/hadoop/manifests
Then we create an init.pp in this new manifests directory with the following content:
We have done a few things here, and they are almost self-explanatory. We are basically setting a variable to point to our hadoop installation. We are downloading Hadoop’s binaries from its Apache location and we are extracting it into the specified hadoop_home directory.
We need to add our new module to the main puppet configuration file. We add the following line at the top of the base-hadoop.pp file:
Then we add this new modules path to our Vagrantfile. So now our puppet section looks like:
We execute the following to reload the vagrant machine:
That command will reload the vagrant machine and execute the puppet recipes. That will install the required software needed.
We will need a cluster of virtual machines. Vagrant supports that. we open our Vagrantfile and replace the content with the following:
After this we execute:
That will start and provision all the servers. That will take a while
But we are not ready. Next we need to configure the hadoop cluster. In the directory modules/hadoop we create another directory called files. Here we will create the needed configuration files for our hadoop cluster.
we create the following files:
192.168.1.12 192.168.1.13 192.168.1.14
We then need to tell puppet to copy these files to our cluster. So we modify our init.pp file in the hadoop puppet module to contain the following:
We then execute:
And we get these files copied to all our servers.
We need to setup ssh password-less communication between our servers. We modify our hadoop-base.pp and leave like this:
We are ready to run our hadoop cluster now. For that, once again we modify the init.pp file in the hadoop puppet module, we add the following at the end, before closing the hadoop class:
The haddop-env.sh file is the original one but we have uncommented the JAVA_HOME setting and pointed it to the correct Java installation.
We can give different names to each host in the Vagrantfile. For that we replace its contents with the following:
Let’s do “vagrant reload” and wait for all systems to reload.
We have provisioned ur systems. Let’s go to our master node and start everything:
vagrant ssh master
then when we are logged in we go to /opt/hadoop-1.0.3/bin
sudo ./hadoop namenode -format
We have started now our hadoop cluster. Now we can visit http://192.168.1.10:50070/ to access our master node and see that our hadoop cluster is indeed running.
All the files for this example (except for the box itself) exist in email@example.com:calo81/vagrant-hadoop-cluster.git for free use.