Saturday, December 18, 2010

Understanding how Java Debug works

I found it surprising that many people that works with Java everyday doesn’t know that there are debugging options that go beyond clicking the “debug” button in your IDE.

You can just attach your IDE to a running application (which has been runned for debug as we’ll see later), or you can even debug it from command line. And the application you debug can even be be in a different machine.

The magic lies in where the debug information actually resides. Apparently people normally think that is the IDE that knows how to debug your programs, but the truth is that is the program who knows how to debug itself, and makes that information available to whoever wants to use it.

The way it works is basically the following. When you compile a program, the .class files get debug information within them, like line numbers or local variables that are made accessible to others who want to access this information. You can then run the program in debug mode passing the following options to your java program execution(you can of course run any java program like this, including mvn goals, appllication servers, etc)

-Xdebug -Xrunjdwp:transport=dt_socket,address=4000,server=y,suspend=y

(you can also use -agentlib:jdwp instead of -Xrunjdwp in latest Java versions)

This line basically says: Run this program in debug mode, use the jdwp protocol, with a socket that listens to port 4000 and waits for connections to continue.

The jdwp protocol is a communication protocol used by the Java application to receive and issue commands and reply to them.

For example, you can connect to this port when the application is running an issue commands like “print variablex” to know the value of a variable, “stop at x” to set a breakpoint, etc. The application issues notification commands like “breakpoint reached”.
The truth is that the protocol is a little more complex than this, but this is enough to know to illustriate the point.

With the previous said, we can see that it would be even possible to debug an application with the use of Telnet! (we'll see later)

Well, enough theory. Let’s see an example Any simple example will do. We’ll make a simple program that takes two parameters from command line and prints the sum. The program won’t be well designed (In the sense that will include some useless temp variables, no validations, etc) but will do to illustrate the example.

class Sum{
    public static void main(String[] args){
        int sum1 = Integer.parseInt(args[0]);
        int sum2 = Integer.parseInt(args[1]);
        int suma= sum1+sum2;
        System.out.println("La suma es "+suma);

So we compile it: javac -g (the g option adds extra debug info to the class. Like local variable names)
And we run it in debug mode: java -Xdebug -agentlib:jdwp=transport=dt_socket,address=4000,server=y,suspend=y Sum 3 4

Now we have the application listening on port 4000 waiting for connections

We will use the jdb command line debugger that comes with java. But first let’s try this. Run the following (you must run the second line fast after the telnet session starts)

telnet localhost 4000

That is the handshake to initiate the communication. You now have a debugging session with Telnet !

Ok, that was only to show, you (or i) would have to know the details of the jdwp protocol to use it. Let’s use instead jdb to debug our application. execute the following:

jdb -attach 4000

you’ll get some output like
Initializing jdb ...
VM Started: No frames on the current call stack


That’s it, you have a debug session started. Now the interesting. Execute the following in your jdb session:

stop at Sum:6 You now have a breakpoint on line 6. execute run on the session, and the program will run until that breakpoint. you’ll get the output

Breakpoint hit: "thread=main", Sum.main(), line=6 bci=18
6 System.out.println("La suma es "+suma);

Now let’s see the value of our variables: run the following commands (one at a time) on the jdb session and see the results.

print sum1
print sum2
print suma
set suma = 10

This is pretty cool stuff. You can debug your program from command line.

Of course if you have the opportunity to use an IDE like Eclipse you should take the advantage of it and still applying what you’ve learnt. So let’s do this.

You need to have the source code of the running application open in your eclipse as a eclipse project for this
Go to the step when you ran the program in debug mode. and run it.

Now go to your eclipse, go to the menu and select RUN -> DEBUG CONFIGURATIONS

In the left panel go to Java Remote Applications, and click new there.

Then select your project, write 4000 in the Port field, and click debug:

That’s it. you have attached your eclipse to the debugging program, now you can put breakpoints, do variable watches, and evaluate expressions from Eclipse.

That’s it. i hope this small article has helped you to understand a little better how the debugging of an application works and how you can debug and application that runs somewhere else.

For more Java Core information you can consult the good official book


Sunday, December 12, 2010

Intro to Groovy closures for Java developers

People who is starting to work in Groovy usually comes from a Java background. And so they have to deal with the new (sometimes very different) features that the new language provides.

One of the new and different features of Groovy is Closures. Closures are a very powerful feature of Groovy, and one of the must deeply used, so they must definitely be understood to take full advantage of the language.

Sometimes people have trouble understanding closures, because there is no such construct in Java.However they are not really hard to understand, taking into account that Groovy is a 100% Object Oriented Language, and that it compiles to normal Java classes.

I will explain the basics of closures making a comparison between Code using closures, and showing (rather roughly) how this would translate into Java.

First a fast introduction.

A simple definition to Closures, is to say that they are language constructs that encapsulate behaviour, as functions, and can be referenced and passed around the code. Expressing it in Java terms they are like a Method without an associated class (We'll see later that this is not true) that can be referenced and passed around as an object.

The construct of a Closure in Groovy is like this:


Yes, that's it, an open curly brace and the closing curly brace. That is the most simple closure, that actually doesn't do anything.

Closures can take parameters, like this {arg1->}. That closure receives one parameter and doesn't do anything.

A closure can for example return the double of it's argument {arg -> arg*2}

You pass a closure to a method as the last argument of the method without parentheses. For example we can have a method somewhere that receives a closure: def methodFoo(closureArg). And it can be called like
methodFoo{arg ->}

From this point, for the purpose of the explanation, we will use the each method on a Groovy List, which receives a closure as parameter and passes every element of the list to this closure.

Let's suppose we want to print de double of each element of a integer list.

[1,2,3,4].each{element -> print element*2}

That's it we printed the double of each element. To achieve the same in Java, and supposing that there exists an each method on java.util.List and that we already built our list with the four elements, it would be something like:

interface MethodObject{
void execute(Integer element); }

list.each(new MethodObject(){
    public void execute(Integer element){
    System.out.println(element * 2) }

As we can see, closures are a really nice way of implementing callbacks at the language level.

In the first case the each method will be something like:

    for(element in this){ }

the second case will look almost the same:

each(MethodObject callback){
    for(Integer element : this){

Closures in groovy are actually objects of type groovy.lang.Closure. So the constructor def a = {arg ->} actually creates an object of type Closure, which as we can see have the method call.

A common source of confusion for closures is the scope of the variables around the closure. For me the best way to understand how this works is to think that at the time of the closure declaration, al visible variables are passed to the closure. For example, taking the previous example and this time multipliying the element by a number taken for the outer scope, we get the following.

def multiplier = 3
[1,2,3,4].each{element -> print element*multiplier}

We can see that the closure can access the reference multiplier and use it.

As objects, i think that what happens is something like the following(Of course this is not at all the way it is implemented, it's just a way to understand that the closures have access to the local variables in the scope where they there are declared):

Integer multiplier=3;

interface MethodObject{
    void execute(Integer element);
class MethodObjectImpl implements MethodObject{
    private Integer multiplier;
    public MethodObjectImpl(Integer multiplier){
    public void execute(Integer element){

list.each(new MethodObjectImpl(multiplier));

As we can see, the local variable "multiplier" would be passed to the declaration of the closure object, so that it "remembers" the variable when it is executed.

So, i think that the most important thing to remember about closures is that they are a very convenient way of expressing callbacks, that they are objects (even although syntactically they don't look like it), and that they remember the variables on their out local scope.

Saturday, November 13, 2010

Using tcpdump and wireshark for debugging

In some of the projects i have worked on i have seen myself in the need of integrating with external systems, in most cases via Web Services, Rest or Soap.

The integration with these external services is usually done with the use of framework help. For example in a couple of projects I used WebServiceTemplate. In other project i had Mule ESB calling to a Web Service Endpoint, etc.

In most cases I pass a pojo to a method and the framework takes care of everything, marshalling, adding headers, sending. And in some cases the debug info that the framework outputs is not enough to see what is going on the wire. I need to see the actual message i’m sending or receiving from the remote host.

For this cases i use tcpdump for capturing traffic, and Wireshark for displaying it. I explain how to do this:

Let’s suppose we are connecting to the Calculator Web Service in

We want to see the exact contents of our request as it leaves our network interface.

we go to a shell window and execute the following as root

tcpdump -i wlan0 -w /tmp/xxx.dmp -s 1500 dst

where wlan0 is the network interface in use, /tmp/xxx.dmp the file where we are storing the output, 1500 the capture packet size, and the destination address for filtering the packets sent (Only packets with this destination will be captured by tcpdump)

We then execute our code that makes the request (In this example i’m doing the request with soapUI, calling the operation add on the web service).

After doing the operation, we terminate the running tcpdump, go to wireshark and open the file /tmp/xxx.dmp. And we see the following:

We can now click on the line with the Protocol HTTP/XML and see the contents of the soap xml message, the http headers, etc:

This way, we have a complete knowledge of what is going on the wire, and maybe this knowledge helps us to debug some problems that can be harder to debug without this knowledge.


Thursday, November 11, 2010

Hadoop Basics

Hadoop is an open source project for processing large datasets in parallel with the use of low level commodity machines.

Hadoop is build on two main parts. An special file system called Hadoop Distributed File System (HDFS) and the Map Reduce Framework.

The HDFS File System is an optimized file system for distributed processing of very large datasets on commodity hardware.

The map reduce framework works in two main phases to process the data. Which are the Map phase and the Reduce phase.

To explain this let's create a sample Hadoop application.

This application will take different dictionaries of english to other languages (English-Spanish) (English-Italian)(English-French) and create a Dictionary file that has the english word followed by all the translations pipe-separated.

- The first thing is of course downloading Hadoop. We go to the directory we want to install hadoop and download it wget

Then unzip it tar zxvf hadoop-0.21.0.tar.gz.

- Now we get our dictionary files. I downloaded them from

- The next thing will be to put our files in HDFS (This example doesn’t really need to do this, but i’m doing it just to show how). For this we need first to format a filesystem to HDFS. This is done in the following way:

- We go to the bin directory of hadoop and execute ./hadoop namenode -format. This will by default format the directory /tmp/hadoop-username/dfs/name.

- After the system is formated we need to put our dictionary files into this filesystem. Hadoop works better with one large files than with many small ones. So we'll merge the files into one to put them there.

- Although this should better be done while writing to the hadoop file system using a PutMerge operation, we are merging the files first and then copying them to hdfs which is easier and our example files are small.

1. cat French.txt >> fulldictionary.txt

2. cat Italian.txt >> fulldictionary.txt

3. cat Spanish.txt >> fulldictionary.txt

- To copy the file to hdfs we execute the following command:
./hadoop fs -put /home/cscarioni/Documentos/hadooparticlestuff/fulldictionary.txt /tmp/hadoop-cscarioni/dfs/name/file

- We will create now the actual map reduce program to process the data. The program will be completely contained in one unique Java file. In the file we will have the Map and the Reduce algorithms. Let's see the code and then explain how the map reduce framework works.

import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Dictionary
    public static class WordMapper extends Mapper<Text, Text, Text, Text>
        private Text word = new Text();
        public void map(Text key, Text value, Context context) throws IOException, InterruptedException
            StringTokenizer itr = new StringTokenizer(value.toString(),",");
            while (itr.hasMoreTokens())
                context.write(key, word);
    public static class AllTranslationsReducer
    extends Reducer<Text,Text,Text,Text>
        private Text result = new Text();
        public void reduce(Text key, Iterable&lt;Text&gt; values,
        Context context
        ) throws IOException, InterruptedException
            String translations = "";
            for (Text val : values)
                translations += "|"+val.toString();
            context.write(key, result);
    public static void main(String[] args) throws Exception
        Configuration conf = new Configuration();
        Job job = new Job(conf, "dictionary");
        FileInputFormat.addInputPath(job, new Path("/tmp/hadoop-cscarioni/dfs/name/file"));
        FileOutputFormat.setOutputPath(job, new Path("output"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);

Watching at the code we can see that our class is built basically of three parts. A static class holds the mapper, other static class holds the reducer, and the main method works as the driver of our application. Follow along with the code as you read the next few paragraphs.

First let’s talk about the mapper:

Our mapper is a very standard mapper. A mapper’s main work is to produce a list of key value pairs to be processed later. The ideal structure of this list of key value pairs is so that the keys will be repeated in many elements of the list (produced by this same mapper or another one that will combine it’s results with this one) so the next phases of the map reduce algorithm make use of them. A mapper receives a key, value pair as parameters, and as said, produce a list of new key, value pairs.

The key value pair received by the mapper depends on the InputFormat implementation used. In our example we are using KeyValueTextInputFormat. This implementation uses as each key value pair, the begining of each line of the input file until the first space as the key, and the rest of the line as the value. So if a line contains aaa bbb,ccc,ddd we’ll get aaa as the key and bbb,ccc,ddd as the value.

From each input to the mapper, the generated list of key value pairs is the key combined with each of the values separated by comma. explaining: For the input aaa bbb,ccc,ddd the output will be: List(aaa bbb, aaa ccc, aaa ddd) and that for each input to the mapper.

The reducer

After the mapper, and before the reducer, the shuffler and combining phases take place. The shuffler phase assures that every key value pair with the same key goes to the same reducer, the combining part converts all the key value pairs of the same key to the grouping form key,list(values), which is what the reducer ultimately receives.

The more standard reducer’s job is to take the key list(values) pair, operate on the grouped values, and store it somewhere. That is exactly what our reducer does. It takes the key list(values) pair, loop through the values concatenating them to a pipe-separated string, and send the new key value pair to the output, so the pair aaa list(aaa,bbb) is converted to aaa aaa|bbb and stored out.

To run our program simply run it as a normal java main file with hadoop libs on the classpath (all the jars in the hadoop home directory and all the jars in the hadoop lib directory. you can also run the hadoop command with the classpath option to get the full classpath needed). For this first test i used the IDE DrJava.

Running the program in my case generated a file called part-r-00000 with the expected result.

Distributing it:

Map Reduce framework main reason of existence is to run the processing of large ammounts of data in a distributed manner, in commodity machines. In fact running it on only one machine doesn’t have much more utility than teaching us how it works.
Distributing the application can be the subject of another more advanced post.

Great books on Hadoop with comprehensive coverage:

Sunday, August 22, 2010

Introduction to Semantic Web (Part 2)

In the first part i talked about the characteristics on which the Semantic Web is built. Now I'll introduce a couple of standards tools that support those characteristics.

The Standards and Tools i’ll cover are:


RDF (Resource Description Framewor): Is the De facto standard way of representing graph data for Semantic Werb, in a way that is both understandable by humans and machines. Is a language in which evrything is represented as a Resource (or a literal value), the subject, the predicate and the object. RDF can be represented in many ways. i’ll use the XML representation. An example RDF would be.

<rdf:rdf owl=""

dc="" fb=""
xhtml="" about="">

<dc:description>The Other Side of Midnight is a novel .....</dc:description>
<xhtml:license resource="">
<fb:media_common.adapted_work.adaptations resource="">

<dc:creator resource="">
< resource="">

In this RDF we can see the three parts of every triple. The subject being in our case the rdf:about after the namespace declarations. Objects can be literals or resources, and predicates are resources. So in the previous example one triple would be:


In this case subject, predicate and object are all resources. As we can see this forms a directed graph.

Namespaces can be mixed as we see, so for example we can create our own custom namespace and mix it with the previous graph definition, extending the knowledge modeled.

SPARQL: Is the language that allow us to query RDF modeled data. From my point of view, it’s a lot like SQL but specifically for querying data in the Triple form. To understand it let’s see an example based on the previous RDF.

PREFIX fb:<”">
PREFIX dc:<””>
PREFIX: rdf:<””>
select ?book
where ?book dc:creator

The previous sparql would return us all the books created by Sidney Sheldon. Notice that all the where clauses would be in the form of Triples. Also binding variables (?book in our case) can be used in more than one triple clause to narrow the results further.

RDFa: Is a set of added constructs that allow us to embbed RDF in XHTML in the form of attributes to normal XHTML tags.For example if we want to expose our knowledge about “The other side of midnight” to the web, allowing potential crawler to see this information we can do something like:

<div xmlns:dc=""
<span property="dc:description">The Other Side of Midnight is a novel... </span>
<span rel="dc:creator" resource=””> Sidney Sheldon</span>

These are maybe the main standards used in Semantic Web. If you are interested in the subject, you should also take a look at the OWL language and ontologies, and tools like Jena and Sesame for graph data repositories and servers.

Tuesday, August 17, 2010

Execute an X application remotely.

Today i bought a netbook, and i wanted to do a little programming on a project. I have the project configured on my main computer where i usually develop.

I didn’t want to create a full environment for my project, because of netbook capacity and because i won’t really program from here very often.

So i decided to use the ssh X forwarding facility. This way i could run my main computer eclipse in my netbook.

The Steps followed were very simple:

First on main computer:

- install openssh
- edit file /etc/ssh/sshd_config and make sure there is a non-commented line like X11Forwarding yes

That’s it for the main computer (Where eclipse really resides)

Then in my netbook i did:

- Open A terminal and execute:

ssh -X cscarioni@ /home/cscarioni/programas/eclipse/eclipse

Where /home/cscarioni/programas/eclipse/eclipse is where eclipse is installed on the main computer.

And that was it!!. i’m working in my home environment from my netbook.

edit: By the way in the home desktop computer xauth must be installed

Wednesday, August 4, 2010

Simple introduction to oauth protocol at the developement level. (with twitter and foursquare examples)

Oauth is basically a protocol for inter-application exchange of tokens, allowing one application to use resources and APIs of the other application, by authenticating between them with these tokens. One application being the Oauth Server, and the other one being the Oauth client.

The first step in configuring an oauth “contract” between the two applications is in the configuration of the server application, by registering the client application.

The server creates an “ID” for the client application from which it will identify it. This ID is formed by two strings known as the “key” and the “secret” . This ID strings are immediately made available to the client application.

The server must also provide 3 URLs to the client:
Request Token URL
Authorize URL
Access Token URL

It also must register a client callback URL

How it works

Request Token URL: URL where the client will request a first token, to identify a “user” within the client application. When accessing this URL with the id and key of the client application, it will return a token (oauth_token) (a String basically) which will then be used to authorize the client, with that particular token to access the server application. It returns another token (oauth_secret_token) that must be used later, so it has to be stored somehow (usually in the session). The first token will then be used to call the Authorize URL.

Authorize URL: This is the URL that is called from the client to give authorization to a specific account on the server application. This means that by accessing this URL, normally the authentication mechanism of the server application will pop, and after filling it, a dialog will pop requesting the acceptance or denying of the authorizations from the client application. After the permission is granted, the server application will call a callback URL on the client application (defined in the server configuration at the begining of the process) sending it a new token (oauth_token). The final Step will be calling the access token URL and..

Access Token URL: When the client application receives the callback call from the server, it takes the new oauth_token from the request, combine it with the saved oauth_token_secret from the first request and call the access token URL with that. The response is yet another two tokens known as the access tokens. These tokens must be stored persistently somewhere for future use, and usually will be associated with some user within the client application.

Now users within the client application can access their resources on the server application, sending their two tokens in every request for identifying them. And of course sending also the key and secret that identify the application.

Seeing it on the wire

This is the exchange using foursquare as an example of Oauth server

1. First the client request a request token, givinf the key of the application and the encrypted secret

GET /oauth/request_token?oauth_nonce=62603210&oauth_timestamp=1280930229&oauth_consumer_key=NO2ML1KLHEEPWMUQZPZSIOIGWX3LEBOTBYXTSIHECZ02WVL1&oauth_signature_method=HMAC-SHA1&oauth_version=1.0&oauth_signature=PRFQvAJngqdrrj3RIM%

2. The server answers OK and returns the tokens. Also returned Sessions cookies XSESSIONID and more (omited).

Expert Info (Chat/Sequence): HTTP/1.1 200 OK\r\n

3. The client sends the authorize request with the oauth_token:


4. The server redirects the client to the server’s authentication page, still referencing the oauth_token.

Request URI: /login?continue=%2Foauth%2Fauthorize%3Foauth_token%3D2BTJEXPZHWRY2CJD5YVHOQUEMD0WVPNKASLP2WDCLMGYN2ZB

5. After the user login into the server application it is redirected to the allow/deny page.

6. When the user allows the permission, the server calls the callback URL on the client sending it the new access token.

7. With this new token, and with the secret token from step 2. A call is made to the Access Url on the server.


8. The server returns the last two tokens that must be saved in the client for future use.

Te following is an example in python of Oauth against foursquare an twitter

First a couple of classes with the details

__author__ = 'cscarioni'

import oauth2 as oauth
import cgi as urlparse
from django.utils import simplejson

from StringIO import StringIO
class BaseOauth(object):


def __init__(self,oauthKey,oauthSecret,requestURL,accessURL,authorizeURL):


def requestAuthorization(self,dictionary):

consumer = oauth.Consumer(self.oauthKey, self.oauthSecret)

client = oauth.Client(consumer)
resp, content = client.request(self.requestURL, "GET")

if resp['status'] != '200':
raise Exception("Invalid response %s." % resp['status'])

request_token = urlparse.parse_qs(content)

return "%s?oauth_token=%s" % (self.authorizeURL, request_token['oauth_token'][0])

def getAccessTokens(self,oauthtoken,dictionary):
consumer = oauth.Consumer(self.oauthKey, self.oauthSecret)

client = oauth.Client(consumer,token)

resp, content = client.request(self.accessURL, "POST")

if resp['status'] != '200':
raise Exception("Invalid response %s." % resp['status'])

access_token = urlparse.parse_qs(content)
return access_token["oauth_token"][0],access_token["oauth_token_secret"][0]

def requestWithOauthJSON(url, key, secret, http_method="GET", post_body=None,http_headers=None):

consumer = oauth.Consumer(self.oauthKey, self.oauthSecret)

token = oauth.Token(key, secret)
client = oauth.Client(consumer, token)

resp, content = client.request(


return simplejson.load(StringIO(content))

class FoursquareOauth(BaseOauth):

def __init__(self):


class TwitterOauth(BaseOauth):
def __init__(self):



which can be used from client code like:

def associateFoursquareAccount(request):
return HttpResponseRedirect(foursquareOauth.requestAuthorization(request.session))

def foursquareCallback(request):

user=User.gql("WHERE login = :login",login=request.session["userlogin"])


return HttpResponseRedirect("/")

Saturday, July 10, 2010

An Introduction to Semantic Web (I)

Introduction to Semantic Web (Part I)

(Based on the book Programming the Semantic Web. from O'Reilly)

What is the Semantic Web?

Semantic refers basically to the meaning of anything tha is under evaluation. In few words something has semantic content if it has a meaning within the context it is been evaluated.

For example: “34” Doesn’t have any real meaning at all. Just a number
On the other hand “34 celcius degrees”. Has a meaning tha anybody can understand (it’s hot)

Semantic Web tries to integrate semantic expressions to the Web. Giving meaning to the data on it, and not just showing meaningless raw data. In this way, knowledge instead of data can be exposed and shared between machines with humans or other machines.

Knowledge representation.

To represent knowledge instead of data, this data must be delevered with it’s meaning. The meaning of the data is actually data about the data, which is known as metadata.
So to represent the knowledge on the Web we must provide the descriptive metadata along with the actual data. This is important to understand: Unlike a database (for example) where the meaning of the data comes somehow in the table and column names, in our representations the metadata goes hand in hand with the actual data.

The basic constructor for the representation of semantic data is the Triple. A Triple is nothing more than a languaje construction of the form (subject, predicate, object). Where “subject” is normally an entity, “predicate” is an attribute or characteristic applied to the subject over the object, which can be an entity with further relations, or literal final value, like a String.

For example: Aragorn son_of Arathorn

With this simple representation we can describe infinite relations and meaningful semantic attributes on infinite entitites originating a complex directed graph of semantic relationships between nodes.

The idea is to represent everything we can as Triples, and save them to a Triple Store. There are many of this triplestores on line, but we’ll see this in the next article.

Example of a mini triple store for our Lord of The Rings Triples:                                                                                                                                                                                                     

As you can see, the idea is to define a standard language for a particular domain and make it easy to extend and share. The structure itself of the Triple Store makes it easy to extend (By adding new Triples) and the sharing comes from standarizing the language used for the particular domain (which we’ll see in the next article). This sharing and extending is the fundamental to the Semantic Web ideas.

As we mentioned before, to represent semantic data and relationships we use directed graphs. For the Triple Store from before, the graph would be something like:

This kind of structure allows us to give answers to questions like the following, in natural language and in Triple mode.

Natural LanguageTripleMeaning

¿who takes care of Frodo?? takes_care_of FrodoReturns all Triples that make the sentence true. In our case
Aragorn takes_care_of Frodo
Who is a hobbit and a ring bearer??a is_a Hobbit
?a bearer_of Anillo
Like the previous case, but this time with a binding variable which must make both sentences true.
In our case this is true just for Frodo.
Frodo es_un Hobbit
Frodo portador_de Anillo
Who is Sam friend of and also a hobbit?
Sam friend_of ?a
a? is_a hobbit
Like in the previous case but with the binding variable in different positions.
Sam friend_of Frodo
Frodo is_a hobbit
Who is friend of Frodo?? friend_of FrodoTriple with multiple results.
Sam friend_of Frodo
Pippin friend_of Frodo

Infering new Triples

From the data and asserts existent in our triplestore, as from our knowledge of the specific domain, we can build rules that help us create new knowledge in our store.

For example. We can define a Rule saying “If X is friend of Frodo the X is friend of Sam”. By implementing this inference rule, we will generate new Triples of knowledge in our store.

Graph merging and linked data

Other option all this alloes, is graph merging. For example, let’s suppose we find somewhere a Triplestore refering to “The Hobbit” novel. It could be merged with our datastore and create more knowledge in our store:

For example. Our Triplestore

Frodo nephew_of Bilbo

The Hobbit

Bilbo saves dwarfes

By relating both Bilbos, we can add this new knowledge to our Triplestore.

This introduction was focused on the description of how the Semantic Web essentials works. In the next article we’ll see the standards languages and tools that make use of this concepts to deliver content and share it  on the Web.

Sharing will be machine-human and most importantly machine-machine.

We’ll see public Triplestores and how to merge them, query them, make inferences out of them and more.


Monday, July 5, 2010

Introduction to semantic Web from a p...

Introduccion a la Red Semantica desde el punto de vista de un programador. (Parte I)

(Basado en el libro Programming the Semantic Web. de O'Reilly)

¿Que es la Red Semantica?

Semantica se refiere basicamente al significado de un elemento cualquiera. En pocas palabras algo tiene contenido semantico, si tiene significado dentro del contexto especifico en el que se está evaluando.

Por ejemplo "34". No tiene ningun significado mas que un numero.

Sin embargo "34 grados celcius de temperatura". Tiene un significado que cualquier persona puede entender e interpretar facilmente. (Hace calor).

La Red Semantica, intenta integrar las expresiones semanticas a la Web. Dando significado a los datos que se encuentran en ella, y no unicamente presentando dichos datos sin significado. Permitiendo de esta forma representar conocimiento en vez de datos crudos y compartirlos entre maquinas.

Representacion del conocimiento

Para representar conocimiento y no datos, es necesario dar significado a estos datos. Información acerca de los datos es lo que conocemos como Metadatos. Por lo tanto para representar el conocimiento en la Red, debemos proveer los datos que forman el conocimiento, y los metadatos que describen estos datos. Es importante entender que los metadatos van explicitamente acompañando a los datos. es Decir, la semantica acompaña al dato.

El constructor básico para la representación del conocimiento es el "tripleta". Un tripleta no es mas que una construcción de la forma: sujeto, predicado, objeto. Donde sujeto es normalmente una entidad, predicado es un atributo o caracteristica sobre esa entidad, y el objeto, puede ser o una entidad, que puede relacionarse con otras, o un valor literal final, como una cadena o un numero.

Por ejemplo Aragorn hijo_de Arathorn

Con esta simple representación podemos describir infinitas relaciones y atributos semanticos sobre distintas entidades, originando un complejo Grafo dirigido de relaciones semanticas. Todos las "tripletas" que se generan se guardan en una estructura conocida como el Almacenamiento de tripletas.



Como se puede observar en el ejemplo anterior, la idea es definir un lenguaje estandar para un dominio particular, que como queda demostrado por el propio ejemplo, es muy facil de extender y compartir, lo cual es la escencia de la Red Semantica.
Facil de Extender porque para añadir nuevas relaciones solo es necesario añadir un tripleta mas al almacen.
Facil de compartir porque es un lenguaje sencillo que se estandariza alrededor de un dominio y que puede ser utilizado por todos para referirse a ese dominio.

Como mencionamos antes, la representación de los datos semanticos, se da mediante el uso de grafos dirigidos. El grafo del ejemplo anterior seria asi:

Esta estructura nos permitiria responder a los siguientes problemas planteados en lenguaje natural y a modo de tripletas:

Lenguaje natural

¿Quien cuida de Frodo?
? cuida_de Frodo
Nos devuelve todas las tripletas que coincidan con el predicado. En este caso Aragorn cuida_de Frodo
¿Quien es un Hobbit y porta el anillo?
?a es_un Hobbit
?a portador_de Anillo
Como el caso anterior. Pero esta vez con una variable de enlace que debe cumplir los 2 predicados. En este caso solo Frodo lo cumple:
Frodo es_un Hobbit
Frodo portador_de Anillo

¿De quien es Sam amigo, que ademas es un Hobbit?
Sam amigo_de ?a
a? portador_de anillo
Igual que el caso anterior pero con las variables de enlace cambiadas de posición
Sam amigo_de Frodo
Frodo portador_de Anillo

¿Quien es Amigo de Frodo?
? amigo_de Frodo
tripleta con multiples resultados posibles que satisfacen la "ecuacion"
Sam amigo_de Frodo
Pippin amigo_de Frodo

Inferencia de nuevas tripletas

A partir de los datos, y las afirmaciones planteados en nuestra base asi como de nuestro conocimiento del mundo y del dominio en particular, podemos crear reglas que nos ayuden a crear nuevo conocimiento en nuestro almacen.

Por ejemplo. Podemos definir una Regla diciendo que "Si X es amigo de Frodo, entonces X es amigo de SAM". Al implementar esta regla de inferencia, se generarán nuevas tripletas por cada tripleta de los amigos de Frodo.

Combinacion de Grafos

Otra opción que nos permite este enfoque, es la combinación de Grafos. Por ejemplo, supongamos que nos llega un conjunto de tripletas referentes a la novela "El Hobbit". Se podría facilmente combinar con nuestro almacen de tripletas y hacer nuestra base de conocimiento mas grande y completa.

Ejemplo En nuestro Grafo podriamos tener:

Frodo sobrino_de Bilbo

En el Hobbit:

Bilbo salva_a Enanos

Al tener relación los identificadores "Bilbo", podemos hacer este ultimo tripleta parte de nuestra base de conocimientos. Combinado así los grafos.

Desarrollaremos nuestro propio Almacen de tripletas en Groovy para ver como almacenamos y accedemos a este conocimiento. Guardaremos las tripletas en memoria en esta implementación. y utilizaremos Groovy para el desarrollo. Crearemos El sistema para hacer Queries a los datos, El sistema para crear los enlaces entre variables y la creación de Reglas de Inferencia.

API del triplestore

//Añadir Triple al Almacen de Triples

void add(subject,predicate,object)

// Hace una query pasando una serie de clausulas al metodo de la forma "?alguien hijo_de Arathorn y ?alguien amigo_de Frodo",  y retorna las tripletas que hagan verdad todos los predicados en conjunto.

def query(clauses)

// Añadir una clase que implemente una regla de inferencia.

void add(inferenceRule)

//Ejecutar las reglas de inferencia y cargar las nuevos tripletas:

void infereTriples()

El codigo Fuente en Groovy estará disponible en

Hemos visto una implementación Rudimentaria hecha a mano de un Almacen de tripletas en Groovy, que nos muestra la esencia del funcionamiento de la RED semantica. Sin embargo, es una solución completamente personal y no distribuible. En la siguiente entrega veremos como el uso de Estandars nos permite utilizar el conocimiento adquirido para crear almacenes de datos semanticos que nos permitan compartirlos con otras personas y mas importante aun, con maquinas.

La siguiente entrega por tanto estará orientada a la estandarización, y como hacer compartible nuestro conocimiento de un dominio por Internet. Veremos como combinar grafos existentes y muchas otras cosas relacionadas con las herramientas y tecnologías que hay a nuestra disposición para el manejo de datos semanticos.

Tuesday, April 27, 2010

Presentación TDD


El viernes pasado hice una presentación de Introducción a TDD de la mano de mi empresa Paradigma Tecnologico.

La grabación de la presentación esta en la siguiente URL

Parte 1:

La presentación con el codigo fuente utilizado la encontrarán en esta URL:

Un Saludo

Wednesday, February 10, 2010

Usefull commands

In this post i'll put all Super usefull commands that i use in my daily work. This is an incremental post. Whenever i find i'm using a usefull command i'll put it here.

Delete CLRF (Windows Carriage Return and new Line) from a file to be usable in Linux

tr -d '\r' < in_file > out_file

Command to delete all .svn files recursively

find . -name .svn | gawk '{print "rm -rf " $0}' | bash

SQL command to disable all FK constraints

select 'alter table ' || table_name ||' disable constraint ' || constraint_name || ';' from (select constraint_name,table_name from user_constraints where constraint_name like 'FK_%')

Capturing Network Traffic

tcpdump -i eth0 -w /tmp/xxx.dmp -s 1500

The file xxx.dmp can then be open with wireshark.

The following is not mine. I found it on the WEB

Adding svn files recursively

svn status | grep "^\?" | awk '{print $2}' | xargs svn add

disable constraints that references a specific table or tables

select 'alter table '||a.owner||'.'||a.table_name||
' disable constraint '||a.constraint_name||';'
from user_constraints a, user_constraints b
where a.constraint_type = 'R'
and a.r_constraint_name = b.constraint_name
and a.r_owner = b.owner
and b.table_name like 'DEYDE%';

get all open ports and processes associated to them:

sudo lsof -i -P

Get the Certificate from a SSL HTTPS site

openssl s_client -showcerts -connect HOST:443 >/tmp/ukash.cert

Importing the certificate to java

keytool -import -alias joe -file server.crt -keystore /home/user/jdk1.5.0_06/jre/lib/security/cacerts

Remember the keystore password is by default "changeit"

To look at what process is using a particular port, you just run the following

lsof -i tcp:8443

To setup a git daemon on your local repo. Execute this from the root of the your project repo (or a folder above it)

git daemon --reuseaddr --base-path=. --export-all --verbose --enable=receive-pack

Then from another machine or the same you can clone it:

git clone git://localhost:9418/ something

Obtain a access token for Github API Oauth (using client_id and client_secret of a registered application): curl -i -X POST -d '{"client_id":"xxxxxx", "client_secret":"xxxxxx"}' -u calo81 Git grep and replace words in your git repo: git grep -l 'original_text' | xargs sed -i '' 's/original_text/new_text/g' do a find excluding some file extensions find /logs/xxx*/home/some/app/shared/log/log/yyy*log* ! -iname "*.gz" Remove a file from the whole git history of your repo: git filter-branch --index-filter 'git update-index --remove postcodes.csv'