carlo scarioni: Setting up a Hadoop virtual cluster with Vagrant

Usually for testing and using virtual machines, I go online, download the iso image of the machine I want to install, start Virtual Box, tell it to init from the iso, and install the OS manually, and then install the applications I want to use. It is a boring and tedious process but I never really cared very much about However recently I discovered the power of Vagrant and also Puppet. They allow me to automate all the steps I used to manually make before.

Here I test drive the process of automatically configuring a Hadoop cluster in virtual machines for a fully distributed mode.

First of all make sure you have Ruby installed. I’m testing with Ruby 1.9.3. You should also have Virtual Box installed. I have version 4.1.

Then from the command line install the vagrant gem:

gem install vagrant

Vagrant is a great tool that allow us to manage our Virtual Box machines using the command line and simple configuration files.

First we will install a linux Ubuntu virtual machine (or a box as it is called in vagrant)

vagrant box add base-hadoop http://files.vagrantup.com/lucid64.box

Then we go to a directory where we want to have our “workspace” and also the directory to create the vagrant configuration file for our new box and execute. This will create a Vagrantfile file with the vagrant configuration.

vagrant init base-hadoop

The virtual machine is ready to be started up now. You can start it by doing:

vagrant up

That is the virtual machine running. You can connect to it with ssh. type

vagrant ssh

Next step is to download Puppet. Do that going to the URL http://puppetlabs.com/misc/download-options/

Puppet is a tool that allow us to automate the process of provisioning servers. We will use it to manage our virtual machines, installing the required software on them and executing the required services.

So we create a directory where we are going to put our manifests (puppet configuration files)

mkdir manifests

in that new directory we create a file called base-hadoop.pp with the following content:

group { "puppet":

  ensure => "present",

}

In the Vagrantfile file that got created previously we uncomment the lines that look like:

config.vm.provision :puppet do |puppet|

     puppet.manifests_path = "manifests"

     puppet.manifest_file  = "base-hadoop.pp"

  end

The next thing we need to do is tell puppet to install Java in our servers. for that we open the base-hadoop.pp file and add the following:

exec { 'apt-get update':

    command => 'apt-get update',

}

package { "openjdk-6-jdk" :

   ensure => present

  require => Exec['apt-get update']

}

Next thing we need to install hadoop. For this we will create a new puppet module. A puppet module is used to encapsulate resources that belong to the same component.

We execute

mkdir -p modules/hadoop/manifests

Then we create an init.pp in this new manifests directory with the following content:

class hadoop {

 $hadoop_home = "/opt/hadoop"

exec { "download_hadoop":

command => "wget -O /tmp/hadoop.tar.gz http://apache.mirrors.timporter.net/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz",

path => $path,

unless => "ls /opt | grep hadoop-1.0.3",

require => Package["openjdk-6-jdk"]

}

exec { "unpack_hadoop" :

  command => "tar -zxf /tmp/hadoop.tar.gz -C /opt",

  path => $path,

  creates => "${hadoop_home}-1.0.3",

  require => Exec["download_hadoop"]

}

}

We have done a few things here, and they are almost self-explanatory. We are basically setting a variable to point to our hadoop installation. We are downloading Hadoop’s binaries from its Apache location and we are extracting it into the specified hadoop_home directory.

We need to add our new module to the main puppet configuration file. We add the following line at the top of the base-hadoop.pp file:

include hadoop

Then we add this new modules path to our Vagrantfile. So now our puppet section looks like:

config.vm.provision :puppet do |puppet|

     puppet.manifests_path = "manifests"

     puppet.manifest_file  = "base-hadoop.pp"

     puppet.module_path = "modules"

  end

We execute the following to reload the vagrant machine:

vagrant reload

That command will reload the vagrant machine and execute the puppet recipes. That will install the required software needed.

We will need a cluster of virtual machines. Vagrant supports that. we open our Vagrantfile and replace the content with the following:

Vagrant::Config.run do |config|

  config.vm.box = "base-hadoop"

  config.vm.provision :puppet do |puppet|

     puppet.manifests_path = "manifests"

     puppet.manifest_file  = "base-hadoop.pp"

     puppet.module_path = "modules"

  end

  config.vm.define :master do |master_config|

    master_config.vm.network :hostonly, "192.168.1.10"

  end

  config.vm.define :backup do |backup_config|

    backup_config.vm.network :hostonly, "192.168.1.11"

  end

  config.vm.define :hadoop1 do |hadoop1_config|

    hadoop1_config.vm.network :hostonly, "192.168.1.12"

  end

  config.vm.define :hadoop2 do |hadoop2_config|

    hadoop2_config.vm.network :hostonly, "192.168.1.13"

  end

  config.vm.define :hadoop3 do |hadoop3_config|

    hadoop3_config.vm.network :hostonly, "192.168.1.14"

  end

end

After this we execute:

vagrant up

That will start and provision all the servers. That will take a while

But we are not ready. Next we need to configure the hadoop cluster. In the directory modules/hadoop we create another directory called files. Here we will create the needed configuration files for our hadoop cluster.

we create the following files:

core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 <configuration>

  <property>

   <name>fs.default.name</name>

   <value>hdfs://master:9000</value>

   <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.</description>

  </property>

 </configuration>

hdfs-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

 <property>

  <name>dfs.replication</name>

  <value>3</value>

  <description>The actual number of replications can be specified when the file is created.</description>

 </property>

</configuration>

mapred-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

 <property>

  <name>mapred.job.tracker</name>

  <value>master:9001</value>

  <description>The host and port that the MapReduce job tracker runs at.</description>

 </property>

</configuration>

masters

192.168.1.11

slaves

192.168.1.12 192.168.1.13 192.168.1.14

We then need to tell puppet to copy these files to our cluster. So we modify our init.pp file in the hadoop puppet module to contain the following:

class hadoop {

 $hadoop_home = "/opt/hadoop"

exec { "download_hadoop":

command => "wget -O /tmp/hadoop.tar.gz http://apache.mirrors.timporter.net/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz",

path => $path,

unless => "ls /opt | grep hadoop-1.0.3",

require => Package["openjdk-6-jdk"]

}

exec { "unpack_hadoop" :

  command => "tar -zxf /tmp/hadoop.tar.gz -C /opt",

  path => $path,

  creates => "${hadoop_home}-1.0.3",

  require => Exec["download_hadoop"]

}

file {

  "${hadoop_home}-1.0.3/conf/slaves":

  source => "puppet:///modules/hadoop/slaves",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

file {

  "${hadoop_home}-1.0.3/conf/masters":

  source => "puppet:///modules/hadoop/masters",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

file {

  "${hadoop_home}-1.0.3/conf/core-site.xml":

  source => "puppet:///modules/hadoop/core-site.xml",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

file {

  "${hadoop_home}-1.0.3/conf/mapred-site.xml":

  source => "puppet:///modules/hadoop/mapred-site.xml",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

 file {

  "${hadoop_home}-1.0.3/conf/hdfs-site.xml":

  source => "puppet:///modules/hadoop/hdfs-site.xml",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

}

We then execute:

vagrant provision

And we get these files copied to all our servers.

We need to setup ssh password-less communication between our servers. We modify our hadoop-base.pp and leave like this:

file {

  "/root/.ssh/id_rsa":

  source => "puppet:///modules/hadoop/id_rsa",

  mode => 600,

  owner => root,

  group => root,

  require => Exec['apt-get update']

 }

file {

  "/root/.ssh/id_rsa.pub":

  source => "puppet:///modules/hadoop/id_rsa.pub",

  mode => 644,

  owner => root,

  group => root,

  require => Exec['apt-get update']

 }

ssh_authorized_key { "ssh_key":

    ensure => "present",

    key    => "AAAAB3NzaC1yc2EAAAADAQABAAABAQCeHdBPVGuSPVOO+n94j/Y5f8VKGIAzjaDe30hu9BPetA+CGFpszw4nDkhyRtW5J9zhGKuzmcCqITTuM6BGpHax9ZKP7lRRjG8Lh380sCGA/691EjSVmR8krLvGZIQxeyHKpDBLEmcpJBB5yoSyuFpK+4RhmJLf7ImZA7mtxhgdPGhe6crUYRbLukNgv61utB/hbre9tgNX2giEurBsj9CI5yhPPNgq6iP8ZBOyCXgUNf37bAe7AjQUMV5G6JMZ1clEeNPN+Uy5Yrfojrx3wHfG40NuxuMrFIQo5qCYa3q9/SVOxsJILWt+hZ2bbxdGcQOd9AXYFNNowPayY0BdAkSr",

    type   => "ssh-rsa",

    user   => "root",

    require => File['/root/.ssh/id_rsa.pub']

}

We are ready to run our hadoop cluster now. For that, once again we modify the init.pp file in the hadoop puppet module, we add the following at the end, before closing the hadoop class:

 file {

  "${hadoop_home}-1.0.3/conf/hadoop-env.sh":

  source => "puppet:///modules/hadoop/hadoop-env.sh",

  mode => 644,

  owner => root,

  group => root,

  require => Exec["unpack_hadoop"]

 }

The haddop-env.sh file is the original one but we have uncommented the JAVA_HOME setting and pointed it to the correct Java installation.

We can give different names to each host in the Vagrantfile. For that we replace its contents with the following:

Vagrant::Config.run do |config|

  config.vm.box = "base-hadoop"

  config.vm.provision :puppet do |puppet|

     puppet.manifests_path = "manifests"

     puppet.manifest_file  = "base-hadoop.pp"

     puppet.module_path = "modules"

  end

  config.vm.define :backup do |backup_config|

    backup_config.vm.network :hostonly, "192.168.1.11"

    backup_config.vm.host_name = "backup"

  end

  config.vm.define :hadoop1 do |hadoop1_config|

    hadoop1_config.vm.network :hostonly, "192.168.1.12"

    hadoop1_config.vm.host_name = "hadoop1"

  end

  config.vm.define :hadoop2 do |hadoop2_config|

    hadoop2_config.vm.network :hostonly, "192.168.1.13"

    hadoop2_config.vm.host_name = "hadoop2"

  end

  config.vm.define :hadoop3 do |hadoop3_config|

    hadoop3_config.vm.network :hostonly, "192.168.1.14"

    hadoop3_config.vm.host_name = "hadoop3"

  end

  config.vm.define :master do |master_config|

    master_config.vm.network :hostonly, "192.168.1.10"

    master_config.vm.host_name = "master"

  end

end

Let’s do “vagrant reload” and wait for all systems to reload.

We have provisioned ur systems. Let’s go to our master node and start everything:

vagrant ssh master

then when we are logged in we go to /opt/hadoop-1.0.3/bin

and do:

sudo ./hadoop namenode -format

sudo ./start-all.sh

We have started now our hadoop cluster. Now we can visit http://192.168.1.10:50070/ to access our master node and see that our hadoop cluster is indeed running.

All the files for this example (except for the box itself) exist in git@github.com:calo81/vagrant-hadoop-cluster.git for free use.

2 comments:

Anonymous said...: Great tutorial. Had a few problems that I had to work around though:

Change base-hadoop.pp to be:

exec { 'apt-get update':
command => '/usr/bin/apt-get update',
}

package { "openjdk-6-jdk" :
ensure => "present",
require => Exec['apt-get update']
}

and change init.pp to be :

class hadoop {
$hadoop_home = "/opt/hadoop"

exec { "download_hadoop":
command => "wget -O /tmp/hadoop.tar.gz http://apache.mirrors.timporter.net/hadoop/common/hadoop-1.1.0/hadoop-1.1.0.tar.gz",
path => $path,
unless => "ls /opt | grep hadoop-1.1.0",
require => Package["openjdk-6-jdk"],
timeout => '0'
}; November 8, 2012 at 12:23 PM
Anonymous said...: base-hadoop.pp, should be consistent along all post; October 30, 2014 at 3:41 AM

carlo scarioni

Friday, September 7, 2012

Setting up a Hadoop virtual cluster with Vagrant

2 comments:

About Me