Sunday, July 3, 2011

Understanding the event driven model for concurrency

For some time now there’s been this idea that the normal concurrent model for web applications that many of us are used to is not the best way to do things for large amounts of concurrent clients.

The model I’m talking about is of course the One Request - One Thread model . I started to read about it and the main concern show towards this model, is that it can’t scale very well to many requests (really really many) because of the cost of threads. The cost of threads is reflected mainly in memory consumption and context switching. Other concern is the complexity associated with thread programming.

I started to look then at the alternative way that was being all talked about on the web (mainly because of Node.js) which is the Evented, or Event Driven model.

So here I will try to introduce how this model works in an easy way.

The general idea is the utilization of the Reactor Pattern.

The Reactor Pattern works normally as a single threaded service loop, which in every iteration of the loop checks for I/O events in its registered handles and dispatches these events to suitable handlers.

The loop must be continually running without blocking operations. The only moment the loop blocks is when no I/O Events are received, but the moment there is one, it must handles it in a fast way and continue.

Let’s now show an example of EventMachine (A Ruby implementation of the Reactor Pattern), and then we show how the Reactor looks like.

The example is a Server, that when receiving one connection makes a query call to MongoDB and send the query result back to the calling client:


  1. Module Server
  2.   def receive_data(received)
  3.        db = EM::Mongo::Connection.new.db('db')
  4.        collection = db.collection('test')
  5.  
  6.        collection.find('_id' => 123) do |res|
  7.          send_data res.inspect
  8.       end
  9.  end
  10. end
  11.  
  12. EM.run do
  13.  EM.start_server '0.0.0.0', 3001, Server
  14. end

Ok so what does this code do?. Well it starts the Reactor in the EM.run call. Then it initializes an Evented Server in port 3001 (The creation of the server and registration happens before the loop starts iterating) and add it to the Reactor list of descriptors.

Now the reactor starts looping, it will go through its descriptors and wait until one of them change state. Right now there is only the server socket descriptor. So for now the event loop is in waiting state.

If we now connect for example with Telnet to port 3001, the Reactor will detect this and check the kind of I/O ready event that just happened on the server socket(in this case it will be an “accept ready” or similar) and then register a new descriptor for the new socket connection. Then the reactor is waiting again for events.

In the moment that the connection is received from the telnet session, in our code example, EventMachine will create a new instance of the Server module (more exactly an instance of an anonymous class that includes the module).

If for example we now send a text from our telnet session the reactor will detect that our last descriptor is ready for a I/O read operation and will process it accordingly. In our code example, the reactor will call the handler that in our case is the receive_data method of our new Server instance.

Next when the data is received in the receive_data method, we make a call to MongoDB. A simple query. But again this operation is non blocking. Internally the “find” method will create a EventMachine connection that will register another descriptor in the reactor. So now we have 3 descriptors registered in the reactor. When the query is ready processing and the data is available, the descriptor will signal the event, the reactor will notice it and then call the specific handler for the event. In this case the code block passed to the call to find. In this callback, we’ll send the data returned from mongo to our connected socket with the send_data method.

To better understand how the Reactor works. I include next some extracts from the EventMachine Java implementation.

The main class of the EventMachine java implementation (which is the implementation used when working with JRuby, and i understand it better than the C++ version which is the one used in normal Ruby) is the EmReactor class. and the main method on it is the run method, which is the one that runs the event loop:
  1.  
  2. public void run() {
  3.        try {
  4.            mySelector = Selector.open();
  5.            bRunReactor = true;
  6.        } catch (IOException e) {
  7.            throw new RuntimeException ("Could not open selector", e);
  8.        }
  9.  
  10.        while (bRunReactor) {
  11.            runLoopbreaks();
  12.            if (!bRunReactor) break;
  13.  
  14.            runTimers();
  15.            if (!bRunReactor) break;
  16.  
  17.            removeUnboundConnections();
  18.            checkIO();
  19.            addNewConnections();
  20.            processIO();
  21.        }
  22.  
  23.        close();
  24.    }
  25.  

In this code we can see the main parts of the Reactor Pattern:

First the infinite loop (or until bRunReactor is false).

And if we see the last three lines of the While loop we can see the steps we have been talking about.

checkIO(): This method “listen” for I/O events in the registered descriptors. Blocking until one I/O channel is ready for something. In java it is implemented with Selector.select(timeout)

addNewConnections(): This method checks for when a new connection is registering with the looper. For example in our Mongo find call.

processIO(): This method loops to the descriptors that are in ready state, check what state they are in (write ready, read ready, etc) and dispatch to the handlers to deal with the event.

So that’s it.. That is the basics of the working of the Reactor Pattern.

Thursday, May 26, 2011

MongoDB Tutorial/Reference Essentials

The following is a quick tutorial / reference to start using the mongodb database.

We’ll start with a fast definition, and then go into the following topics, in a fast example approach:

Installing
Inserting
Querying
updating
deleting
Indexes and Explain
map reducing
drivers
distributing


MongoDB is a document oriented database built with the intention of being able to deal with very big amounts of data with good performance, to adjust to the increasing data that modern applications have to deal with, particularly when “in the cloud”. MongoDB is also intended to keep some of the great characteristics from Relational Databases, like the capacity to execute dynamic queries, so that you don't miss this great flexibility.

Installing:


Installing MongoDB for using in our testings is relly easy just go to http://www.mongodb.org/downloads and download the pertinent version (This references uses Linux and MacOSX).

After download, gunzip and untar the file and that’s it. It is installed.
To start mongo server go to the untared directory, go to bin directory and execute ./mongod (This will assume you have a /data/db directory in your system, if you don’t create one)


Inserting:


MongoDB works with Documents. In Mongo, a Document is simply a binary representation of a JSON object called BSON. you can simply think that a Document is a JSON object, and the binary thing is just the way mongo represents this object internally.

In Mongo, every document must belong to a Collection (a Collection can be though as a table of a RDBMS, but just for help understanding them, because they are different things), and every Collection should belong to a Database.

So let’s say we want to insert a car Document, into a Cars collection that belong to the Concesionary Database. We would do the following:

- From the bin directory, and with the server started, we execute ./mongo to open the interactive Shell. The interactive shell of mongo allow us to interact with the database server using Javascript.

- Next, we change to use our concesionary database (Even although the database doesn’t exist yet this command will work)

use concesionary

- Next, we insert our new Car in the cars collection (Again, the collection doesn’t exist yet, but it (and the concesionary database) will get created when inserting the first element.)

db.cars.insert({maker:'ferrari',model:'f50',acceleration:{speed100:3,speed200:9},colors:['white','black']});

As you can see we are inserting a new card, that is basically a JSON object (including simple types, subdocument types and arrays).

Let’s insert another car to use in the next session on querying:

db.cars.insert({maker:'fiat',model:'500',acceleration:{speed100:10,speed200:’NEVER’},colors:['blue','red']});

Querying:


MongoDB allows you a lot of flexibility in querying, very close to what you can do with SQL. You can use lots of filters, comparisons, etc. We just do a couple of basics queries here, to get your feet wet.

In general you query MongoDB calling the find method on the collection, and passing a JSON document with the selections you want to query on:

db.cars.find()

{ "_id" : ObjectId("4dde4d1c6eb878af72075592"), "maker" : "ferrari", "model" : "f50", "acceleration" : { "speed100" : 3, "speed200" : 9 }, "colors" : [ "white", "black" ] }

{ "_id" : ObjectId("4dde4f3d6eb878af72075593"), "maker" : "fiat", "model" : "500", "acceleration" : { "speed100" : 10, "speed200" : "NEVER" }, "colors" : [ "blue", "red" ] }



db.cars.find({maker:'ferrari'});


{ "_id" : ObjectId("4dde4d1c6eb878af72075592"), "maker" : "ferrari", "model" : "f50", "acceleration" : { "speed100" : 3, "speed200" : 9 }, "colors" : [ "white", "black" ] }

db.cars.find({'acceleration.speed200':'NEVER'});

{ "_id" : ObjectId("4dde4f3d6eb878af72075593"), "maker" : "fiat", "model" : "500", "acceleration" : { "speed100" : 10, "speed200" : "NEVER" }, "colors" : [ "blue", "red" ] }

db.cars.find({'colors':'white'});

{ "_id" : ObjectId("4dde4d1c6eb878af72075592"), "maker" : "ferrari", "model" : "f50", "acceleration" : { "speed100" : 3, "speed200" : 9 }, "colors" : [ "white", "black" ] }

Updating:


Basic updating is pretty straightforward, it needs a filter document, like find, and a parameter indicating how to modify the document:

db.cars.update({model:'f50'},{$set:{model:'f40'}});

db.cars.find({maker:'ferrari'});

{ "_id" : ObjectId("4dde54b56eb878af72075594"), "maker" : "ferrari", "model" : "f40", "acceleration" : { "speed100" : 3, "speed200" : 9 }, "colors" : [ "white", "black" ] }

Deleting


Even more starightforward than updating, just requiring the document filter (or nothing if you want to delete all the documents in the collection):

db.cars.remove({maker:'ferrari'});

db.cars.find()
{ "_id" : ObjectId("4dde4f3d6eb878af72075593"), "maker" : "fiat", "model" : "500", "acceleration" : { "speed100" : 10, "speed200" : "NEVER" }, "colors" : [ "blue", "red" ] }
db.cars.remove()
db.cars.find()
>

Creating indexes, and query explain:


Indexes, as in any other database are extremely important in MongoDB and extremely important to get them right. They work like you may expect, and allow you to accelerate the speed and performance dramatically of your queries if applied right. You can create compund indexes as well. Here we will touch the basics once again.

Let’s insert our two cars again:

db.cars.insert({maker:'fiat',model:'500',acceleration{speed100:10,speed200:'NEVER'},colors:['blue','red']});

db.cars.insert({maker:'ferrari',model:'f50',acceleration{speed100:3,speed200:9},colors:['white','black']});

MongoDB automatically creates an index for the _id property of its documents. We can query existent indexes like this:

db.system.indexes.find()

{ "name" : "_id_", "ns" : "concesionary.cars", "key" : { "_id" : 1 }, "v" : 0 }

Our application probably will make a lot of queries per car maker, so we will add an index to the maker property like this:

db.cars.ensureIndex({maker: 1})

Now when we query for existent indexes we get our new index:

{ "name" : "_id_", "ns" : "concesionary.cars", "key" : { "_id" : 1 }, "v" : 0 }
{ "_id" : ObjectId("4dde57fe6eb878af72075597"), "ns" : "concesionary.cars", "key" : { "maker" : 1 }, "name" : "maker_1", "v" : 0 }


So how can we see if some query is using our index?. Simple enough we use the explain method to do so, but before doing that, let’s remove the index and run explain without it.

db.runCommand({deleteIndexes: "cars", index: "maker_1"})

db.system.indexes.find()
{ "name" : "_id_", "ns" : "concesionary.cars", "key" : { "_id" : 1 }, "v" : 0 }


We removed the index, now let’s see explain in action:

db.cars.find({maker:'ferrari'}).explain()

{
"cursor" : "BasicCursor",
"nscanned" : 2,
"nscannedObjects" : 2,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}


The main things to take a look at when running explain (for the purposes of our discussion) is the type of cursor, the nscanned, and the n attributes.

The cursor “BasicCursor” is simply a cursor that scans through all the collection to get the query results, nscanned is the total documents scanned, and the n is the total documents returned. In an ideal world the n and nscanned should be the same.

Now let’s create the index again and rerun the explain for the query:

db.cars.ensureIndex({maker: 1})
db.cars.find({maker:'ferrari'}).explain()

{
"cursor" : "BtreeCursor maker_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"maker" : [
[
"ferrari",
"ferrari"
]
]
}
}



We can see the different results. We are using the index (Indicated by the cursor property) and the nscanned and n properties have the same value. We are just scanning the elements we are returning.


Map Reducing:


Apart from the common grouping operations allowed by MongoDB (like sum, max, etc) we can use map reduce for more grained and customized grouping requirements, and it is built into the mongodb functionality. (For an explanation of map reduce see: http://cscarioni.blogspot.com/2010/11/hadoop-basics.html). Here we show an extremely simple map reduce:

We want to simply count all the cars per maker:
First we insert a new fiat into the collection (Do that yourself);
Then we define our map reduce in one line like this:

db.cars.mapReduce(
function(){
emit (this.maker,{number:1})
},
function(key,values){
var total = 0;
values.forEach(
function(value){
total += value.number;
});
return {total:total}},"result"
)


When runned we get this:

{
"result" : "result",
"timeMillis" : 3,
"counts" : {
"input" : 3,
"emit" : 3,
"output" : 2
},
"ok" : 1,
}



and the result of the counting is in the new result collection:

db.result.find()
{ "_id" : "ferrari", "value" : { "number" : 1 } }
{ "_id" : "fiat", "value" : { "total" : 2 } }



As we can see the mapReduce method receives a map function, a reduce function, and normally the name of the collection to store the results.


Drivers:


In this section we aren’t going to say a lot. Simply that there already exists mongodb drivers for the most common programming languages out there, they all work kind of the same (taking into account the advantages and limitations of each programming language) and they are pretty easy to start experimenting with.



Distributing:


One of the most important characteristics of MongoDB is its support for distribution, from creating Replica Sets to Sharding, I’ll cover that in a soon to write article. For now just to say that the sharding model is really powerfull and allow for transparent failover, and transparent sharding and distribution of data chunks accross the sharded cluster.

Friday, April 29, 2011

Ruby Dynamics

When we talk about Dynamic Languages, it means different things to different people. Is a language dynamic because of dynamic typing?. is it because it can execute scripts of arbitrary code dynamically?. is it because of duck typing?. is it because we can extend it on runtime?. Or is it because all of the previous reasons together?.

I think that each of the features (and more) add to the level of dynamism of the language.

In this article I’ll explore some of the characteristics that make Ruby an incredibly dynamic language. I’ll name some of these features first and then go briefly into what they are and how they work.



Dynamic Typing:

This characteristic is easy, just meaning that there doesn’t exist a type for variables at compile time, just at runtime, and it can be changed during the variable lifecycle as this irb session shows:

irb(main):001:0> a="dsd"
=> "dsd"
irb(main):002:0> a.class
=> String
irb(main):003:0> a=5
=> 5
irb(main):004:0> a.class
=> Fixnum
irb(main):008:0> local_variables
=> [:a, :_]



Duck Typing

This is another common characteristic of dynamic languages, it’s meaning is that regardless the type of a variable if it responds to a particular message, then it can receive that message and act accordingly:

For example, the message “fly” can be sended to a Airplane, or a Bird. In a language like Java, you might need to define an Interface like “FlyCapable” or similar that defines the fly method, in Ruby (and other languages with Duck Typing) you can just do:

  1. class Bird
  2.  def fly
  3.     puts "I am a flying bird"
  4.  end
  5. end
  6.  
  7. class Airplane
  8.  def fly
  9.     puts "I need a motor, but also fly"
  10.  end
  11. end
  12.  
  13. def make_entity_fly(entity)
  14.  entity.fly
  15. end
  16.  
  17. bird = Bird.new
  18. plane = Airplane.new
  19.  
  20. make_entity_fly bird
  21. make_entity_fly plane


and when you run it, you get

I am a flying bird
I need a motor, but also fly




Extending existent classes at runtime.

In this point i am not refering to extend in the sense of inheritance but instead actually extending the functionality of a class that already exists, for example adding a new method to the String class. There are a few ways to do this, you can reopen the class, or you can use the class_eval method to do it.

let’s see both, let’s suppose we want to add a method to the String class we can just do:

  1. class String
  2.  def my_method
  3.     self + " and some extra "
  4.  end
  5. end
  6.  
  7. string = "Original String"
  8. puts string.my_method
  9. #And this is still a normal String with the common String methods of course
  10. puts string.capitalize



Using class_eval:

  1. String.class_eval do
  2.  define_method :my_method do
  3.     self + " and some extra "
  4.  end
  5. end
  6.  
  7. string = "Original String"
  8. puts string.my_method
  9. #And this is still a normal String with the common String methods of course
  10. puts string.capitalize



Modules

Modules is one of the nicest things in Ruby, easy to understand and a very powerful feature. Modules allow you to isolate behaviour in one place and use this behaviour (methods) from classes that include the module, as if the methods where defined in the class itself, allowing its instances to use them. An example to illustrate this better.

  1. module StartAndStop
  2.  def start
  3.     @state = "started"
  4.  end
  5.  def stop
  6.     @state = "stoped"
  7.  end
  8. end
  9.  
  10. class Car
  11.  include StartAndStop
  12.  def state
  13.     puts @state
  14.  end
  15. end
  16.  
  17. car = Car.new
  18. car.start
  19. car.state
  20. car.stop
  21. car.state



And the output

ruby Ruby-1.rb
started
stoped




Ruby is Object oriented no class oriented

I didn’t realize this until i started working with Ruby, But Java is really more Class oriented than object oriented. In Java one object’s features are forever bound to it’s class (and superclass) definitions. In Ruby by contrast, Objects can have a life of their own independent from their class, So in Ruby classes work as a template for creating objects, but from that moment the object lives on its own:

  1. class Person
  2.  def talk
  3.     puts "hola"
  4.  end
  5. end
  6.  
  7. p1 = Person.new
  8. p2 = Person.new
  9.  
  10. def p1.scream
  11.  puts "HOLA!!"
  12. end
  13.  
  14. p2.instance_eval do
  15.  undef :talk
  16. end
  17.  
  18. p1.talk
  19. p1.scream
  20.  
  21. p2.talk



And the output

ruby Ruby-1.rb
hola
HOLA!!
Ruby-1.rb:20:in `
': undefined method `talk' for # (NoMethodError)





Extending the singleton class of an object.


In Ruby every object, apart from the classes it extends explicitly, has associated a class for it’s own. This classes are known as Eigenclasses or Anonymous classes. We already see a little of how to extend this classes with behaviour when we saw the definition of the scream method directly in the object. However if you want to open a Eigenclass of an object and keep it open to add something to them you can do it like:

  1. a = String.new
  2.  
  3. class << a
  4.  def append_salute
  5.     self << "Hola"
  6.  end
  7. end
  8.  
  9. a.append_salute
  10. puts a




Extending using modules dynamically


We already saw that you can add behaviour to a class with the inclusion of modules. And you can also include modules directly in an object (actually to it’s Eigenclass) You can do this two ways:

1)
  1. module DummyModule
  2.  def method_dummy
  3.     puts "inside dummy method"
  4.  end
  5. end
  6.  
  7. a=String.new
  8. a.extend DummyModule
  9. a.method_dummy



2)
  1. module DummyModule
  2.  def method_dummy
  3.     puts "inside dummy method"
  4.  end
  5. end
  6.  
  7. a=String.new
  8.  
  9. class << a
  10.  include DummyModule
  11. end
  12. a.method_dummy





Using eval and similars

We already saw about class_eval and instance_eval before, the other method to know about is the eval method. It basically allow you to execute arbitrary Strings of code as Ruby code:
  1. string_of_code = "class ClaseDummy
  2. def method_dummy
  3.    puts 'dummy'
  4. end
  5. end"
  6.  
  7. eval(string_of_code)
  8. a=ClaseDummy.new
  9. a.method_dummy


Creating a new type in runtime.

In Ruby, a class is an object of type Class, normally identified by a name as a Constant. So it is easy to create a new type that doesn’t exist:

  1. Person = Class.new do
  2.  def talk
  3.     puts "hola"
  4.  end
  5. end
  6.  
  7.  
  8. p=Person.new
  9. p.talk


And the output

ruby Ruby-1.rb
hola





Dynamic Dispatch


Dynamic dispatch means allowing to send messages (call methods) to objects, determined at runtime. You don’t know the method you want to call until runtime:

  1. class Person
  2.  def talk
  3.     puts "hola :)"
  4.  end
  5. end
  6.  
  7.  
  8. p=Person.new
  9. metodo = "talk"
  10.  
  11. p.send metodo


And the output:

ruby Ruby-1.rb
hola :)


Defining methods dynamically

This means creating methods in Runtime that don’t yet exist at compile time. There are a couple of ways to do this. One is:
  1. class Person
  2.  def talk
  3.     puts "hola :)"
  4.  end
  5. end
  6.  
  7.  
  8. new_method="scream"
  9.  
  10. Person.class_eval do
  11.  define_method new_method do
  12.     puts "HOLA!"
  13.  end
  14. end
  15.  
  16. p=Person.new
  17. p.scream
  18. puts Person.instance_methods false



And the output

ruby Ruby-1.rb
HOLA!
talk
scream



Ghost Methods


Ghost methods are methods that not really exist but you can still call. How cool is that?. They leverage a special method in every Ruby object named missing_method that gets called when calling a method on an object that doesn’t exist in all its lookup path.

  1. class Person
  2.  def method_missing (name,*args,&block)
  3.     if name == :talk
  4.      puts "hola "+args[0]    
  5.     else
  6.      super
  7.   end
  8.  end
  9. end
  10.  
  11. p=Person.new
  12. p.talk "carlo"
  13. p.scream



And the output

ruby Ruby-1.rb
hola carlo
Ruby-1.rb:6:in `method_missing': undefined method `scream' for # (NoMethodError)
from Ruby-1.rb:13:in `
'

This was just a small introduction, all these techniques and many more can be combined and extended to achieve really powerfull results.

A couple of good books on Ruby and the subjects of dynamics and metaprogramming are:



Monday, March 21, 2011

Making rich domain models in Ruby is nicer than Java

Being a Java developer starting to do some programing in Ruby i found many different nice things in the language and the way things are done.

One of the nicest things i found is the notion of module mixing, and although you can artificially do the same with Java and Spring (google for @DeclareParents) , it’s not as neat and clean as Ruby does it.

Module mixins basically allow us to add behaviour to our objects without inheritance. Is like if in Java we could use an Interface that has a default behaviour associated to it, so we don’t need to actually implement the methods of the interface.

The objects that has the module mixed in, can receive method calls on the module like if the methods belong to the object itself. This allows to a nice approach to domain rich programming, while at the same time maintaining a great separation of concerns and avoiding filling the classes and objects with lots of not very related functionality.

To explain the last paragraph better, let’s take a simple example. a Person class.

In our example a person can have emotions (happy, sad, etc), can do actions (play, sleep), have measure attributes and queries(height,weight). It can also be used in database operations (If we want like an ActiveRecord approach) (save, update, etc).

In a Model centric approach (Instead of using services, DAOs, etc) we would like to have Person objects that allow us to do thing like:

  1. if (person.sad?)
  2.  
  3. person.playWith(toy)
  4.  
  5. if (person.higher_than?(person2))
  6.  
  7. person.store

We can consider each of these methods addressing different concerns, but all valid thing for a Person to know about. if we were to program this with Java, we would end up with a Person class with all these methods like:

  1. public class Person{
  2.     public boolean isSsad(){...}
  3.     public void playWith(Toy toy){...}
  4.     public int store(){...}
  5.     public boolean isHigherThan(Person person2) {...}
  6. }

We could argue that a better approach would be to have “behavioural” modules and plug them individually to our class. For example in Ruby we could have something like:

  1. class Person
  2.    include Sentiments
  3.    include Actions
  4.    include Persistence
  5.    include Measures
  6. end

We can have then each module dealing with its own concern, and making all this knowledge and capacities available to the Person objects as if they were defined on the Person class.

So for example our “Measures” module could have something like:

  1. module Measures
  2.   attr_accesor :height  
  3.   def higher_than?(entity)
  4.    return self.height > entity.height
  5.   end
  6. end

Then we can do something like:

  1. a= Person.new
  2. a.height = 150
  3. b=Person.new
  4. b.height=200
  5.  
  6. b.higher_than? a

This same approach can be used for the other modules like Persistent or Sentiments.

Thursday, February 17, 2011

Is it harder to Unit Test in Grails than Java?. Or am i doing it wrong?







I want to develop a Service (A class method) that does the following. Receives some text, calls (POST to) some HTTP URL with that text as body, and retrieve the value of a cookie (SESSION_ID cookie) from the response.

If i were solving this in Java i would start with some test like this (This code doesn't compile I just wrote it in notepad):

  1. private CommunicationService testObj;
  2. private static String REQUEST=”ANY_REQUEST_WILL_DO”
  3. private static String URL_CONST=”http://www.google.com”
  4. private static String SESSION_ID=”aassdd”
  5. private Configuration conf;
  6. private CookieValueExtractor cookieValueExtractor;
  7. private HttpClient httpClient;
  8. private HttpResponse response;
  9.  
  10. @Before
  11. public void setup(){
  12.     conf=new MockConfiguration();
  13.     cookieValueExtractor=new MockCookieValueExtractor();
  14.     httpClient=new MockHttpClient();
  15.     response=new MockHttpResponse();
  16.     testObj.setConfiguration(conf);
  17.     testObj.setCookieValueExtractor(cookieValueExtractor);
  18.     testObj.setHttpClient(httpClient);
  19. }
  20.  
  21. @Test
  22. public void testSendRequestToHttpAndReturnSessionId(){
  23.     Capture<HttpPost> httpPost=new Capture<HttpPost>();
  24.     expect(conf.getServerUrl()).andReturn(URL_CONST);
  25.     expect(httpClient.execute(capture(httpPost))).andReturn(response);
  26.     expect(cookieValueExtractor.extractCookieByName(response,“SESSION_ID”)).andReturn(SESSION_ID)
  27.     String cookieValue = testObj.sendRequest(REQUEST);
  28.     assertEquals(URL_CONST,httpPost.getValue().getUri().toString());
  29.     assertEquals(SESSION_ID,cookieValue);
  30. }




Taking all setup code out, and focusing on the test method, we can see what we expect our method will do.

1. It will retrieve a URL for a Configuration facility
2. It will use HttpClient to send a Post to that URL
3. It will call a CookieExtractor facility to extract the cookie value we are interested in.
4 It will return that cookie value to the caller.


It is pretty easy to understand.

I now have the same requirement in a Grails project. Of course i want to use the power of Groovy and Grails, so i check the HttpBuilder library out, read a little bit about it and start to work.

The httpBuilder basically works by passing two parameters. A Map, with all the options (Including body, Path, etc) and a Closure that will get the response from the invocation.

Also i need to obtain some values like the URL from grails configuration files.

At the end, going to my implementation class and my test many times while developing(which i usually don’t need to in Java) i end with the following.


  1. void testMakeRequest() {
  2.        configureConfigurationHolder()
  3.        moneybookerService=new MoneybookerService()
  4.        def httpBuilderMock=new MockFor(HTTPBuilder.class)
  5.        httpBuilderMock.demand.post{
  6.            a,b ->
  7.            assertEquals("test_payment.pl",a["path"])
  8.            
  9.            def resp=new DummyResponseWithCookie()
  10.            resp.init()
  11.            def sessionId=b.call(resp)
  12.            
  13.            assertEquals("123123", sessionId)
  14.            
  15.        }
  16.        def mockService=httpBuilderMock.proxyInstance()
  17.        moneybookerService.httpBuilder=mockService
  18.        moneybookerService.makeRequest()    
  19.        httpBuilderMock.verify mockService
  20.     }
  21.  
  22. private void configureConfigurationHolder(){
  23.        def expando=new Expando()
  24.        expando.url=""
  25.        expando.path="test_payment.pl"
  26.        ConfigurationHolder.config=["moneybooker":expando]
  27.    }
  28.    
  29.    class DummyResponseWithCookie{
  30.        def headers
  31.        def init(){
  32.            def expando = new Expando()
  33.            expando.name="SESSION_ID"
  34.            expando.value="123123"
  35.            def outerExpando=new Expando()
  36.            outerExpando.elements=[expando]
  37.            headers=["Set-Cookie":outerExpando]
  38.        }

A lot of work. The main problems that I find are

First and most important. How do i Test closures?. In my case, i had to mock a response object (DummyResponseWithCookie) and because i knew what i wanted the closure to do, call the closure with that mock. But this doesn’t look good. I’m calling the closure in the demand of the mock of httpBuilder. So this test is testing more than what it should. But i don’t know how to test the closure on its own.

Second, the way to mock the configuration holder. Just setting static variables doesn’t seem very right.

Third. This is probably lack of practice but It took me longer to write this test, and was really difficult to think exactly what i wanted to achieve in my service, what needed to be mocked and in general how to approach it in a TDD way.

I really like Groovy, and Grails. I just don’t know how to approach TDD and Unit testing in general in a way that feels as comfortable as with Java.

Wednesday, January 26, 2011

Distributing Hadoop

As mentioned in the previous article Hadoop Basics the value of hadoop is in running it distributed in many machines.

In this article I will introduce the how-to configure Hadoop for distributed processing.
I’ll show how to do it with just to machines, but it will be the same for more as one of the main values of hadoop is the ability to scale easily.

1. Ok, so we download hadoop 0.21.0 from here http://mirror.lividpenguin.com/pub/apache//hadoop/core/hadoop-0.21.0/hadoop-0.21.0.tar.gz
in both machines. uncompress the file.

2. We have two inndependent Hadoops right now, but we want them to run in cluster. So we have to make some configuration.
Hadoop distributed works with 5 different daemons that communicate with each other. The daemons are:

NameNode: Is the main controller of the HDFS, it takes care of how the files are broken into blocks, which nodes contain each block and the general tracking of the distributed filesystem.

DataNode: This daemon serves the HDFS requirements of individual slave nodes communicating and coordinating with the NameNode.

Secondary NameNode: Takes snapshots of the NameNode for possible recoveries.

JobTracker: Is in charge of coordinating the task submissions to different nodes.

TaskTracker: Existent in each processing node, they are in charge of executing the tasks submited by the JobTracker, communicating with it constantly.

All communication between the hadoop is done through ssh. We will designate a Master Node (which will contain the NameNode and JobTracker) and two slave nodes. The master node must be able to communicate with the slave nodes through ssh using the same username. (I’m using my username cscarioni communicating without passphrase using private/public key authentication)

So as we are using two machines our architecture will be like this:
Machine 1 (Master) Machine 2 (Slave)
NameNode
JobTracker
Secondary NameNode
TaskTracker
DataNode

TaskTracker
DataNode


We go to our Master installation of hadoop, and enter the conf directory.

In the core-site.xml we specify the NameNode information. we put the following.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master-hadoop:9000</value>
</property>
</configuration>


In the mapred-site.xml we specify where the job tracker daemon is:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master-hadoop:9001</value>
</property>
</configuration>
In the hdfs-site.xml we specify the replication of the cluster. In our case 2:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

The masters and slaves files as they name says contains the names of the masters and slaves nodes. We have to modify them to include our master and slave nodes. (I defined in the hosts file of both machines the following host names.)

So in the masters we put

hadoop-master


And in the slaves we put

master-hadoop
carlo-netbook

we change now the hadoop-env.sh, uncommenting the JAVA_HOME line and point it to our JAVA_HOME.


Ok, these are all the files we need, we now distribute (copy) these files to both machines.


We go now to the bin node on the master node and execute ./hadoop namenode -format, to format the hdfs.

We execute now in the same directory: ./start-all.sh.

That’s it, we ran Hadoop. We now need to put some files in the HDFS and submit a map reduce task to it.

For this example i’ll use a custom made file that in each line has the word God or the Word Devil. I created the file with the following Groovy script

def a  = new File("/tmp/biblia.txt")
random = new Random()
a.withWriter{
    for (i in (0..5000000)){
        if(random.nextInt(2)){
            it << "GOD\n"
            }else{
            it << "Devil\n"
        }
    }
}









from the master’s hadoop bin directory, copy the file from the file system into hdfs with:

./hadoop fs -put /home/cscarioni/downloads/bible.txt bible.txt

to see that the file has been created do:

./hadoop fs -ls

I get the follwoing output

-rw-r--r-- 2 cscarioni supergroup 4445256 2011-01-24 18:25 /user/cscarioni/bible.txt

Now we create our MapReduce program (It just counts how many times the words GOD and Devil are in the file):



import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class GodVsDevils
{
    public static class WordMapper extends Mapper<LongWritable, Text, Text, LongWritable>
    {
        private LongWritable word = new LongWritable();
        private Text theKey = new Text();
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
        {
            String who =value.toString();
            word.set(1);
            if(who.equals("GOD"))
            {
                theKey.set("God");
                context.write(theKey, word);
            }
            else if(who.equals("Devil"))
            {
                theKey.set("Devil");
                context.write(theKey, word);
            }
        }
    }
    public static class AllTranslationsReducer
    extends Reducer<Text,LongWritable,Text,LongWritable>
    {
        private LongWritable result = new LongWritable();
        public void reduce(Text key, Iterable<;LongWritable>; values,
        Context context
        ) throws IOException, InterruptedException
        {
            long count = 0;
            for (LongWritable val : values)
            {
                count += val.get();
            }
            result.set(count);
            context.write(key, result);
        }
    }
    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf,"GodDevils");
        job.setJarByClass(GodVsDevils.class);
        job.setMapperClass(WordMapper.class);
        job.setReducerClass(AllTranslationsReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        FileInputFormat.addInputPath(job, new Path("/user/cscarioni"));
        FileOutputFormat.setOutputPath(job, new Path("output"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}






We compile it , jar it and then execute the following in the master node:

./hadoop jar god.jar GodVsDevils -fs master-hadoop:9000 -jt master-hadoop:9001

This will run our map reduce in the hadoop cluster.