The file looked somthing like this:
movie-a Comedy
movie-b Comedy
movie-a Romance
I wanted to produce a file of the form
movie-a [Comedy,Romance]
Ignoring the movies that don't include both genres.
I inmediately thought of using hadoop, even if the file was not huge, the map reduce algorithm seems a good fit for the problem.
I have done some small work in hadoop with Java, but in this case my project was Ruby based and I wanted to keep on using Ruby even for my hadoop job, so I used the streaming API of hadoop to solve the problem.
I needed to develop a fast and easy solution. What follows is the code:
Consulting some bibliography and the Web I came to a very easy solution.
map.rb
- ARGF.each do |line|
- begin
- parts = line.split("\t")
- puts parts[0]+"\t"+ parts[parts.size-1]
- rescue
- puts 'error'
- end
- end
- current_key = nil
- current_key_values=[]
- ARGF.each do |line|
- line = line.chomp
- (key, value) = line.split(/\t/)
- if current_key.nil?
- current_key=key
- end
- if current_key==key
- current_key_values<<value
- else
- if current_key_values.include?("Comedy") and current_key_values.include?("Romance")
- puts current_key + "\t" + current_key_values.to_s
- end
- current_key=key
- current_key_values=[value]
- end
- end
- #!/bin/bash
- HADOOP_HOME=/home/cscarioni/programs/hadoop-0.22.0
- JAR=contrib/streaming/hadoop-0.22.0-streaming.jar
- STREAMCOMMAND="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"
- $STREAMCOMMAND \
- -mapper 'ruby map.rb' \
- -reducer 'ruby reduce.rb' \
- -file map.rb \
- -file reduce.rb \
- -input '/home/cscarioni/Downloads/comedy_romance_movies' \
- -output /home/cscarioni/Downloads/comedy_romance_movies_results
I consider the main differences between the streaming API and the Java API are:
1. The streaming API works everything in the stdin and stdout between the scripts, (take a look at the ARGF and puts use in both map.rb and reduce.rb) like when we use the pipe in the command line between commands
2. In the Java API the results from the mapper phase are grouped together for example in our case we would actually receive directly on the reduce phase the line (movie-a [Romance,Comedy]). In the streaming API on the other hand, the grouping needs to be done manually, what we get is an ordered list by key (so all the movie-a would be next to each other).
So there we have a small and functional map reduce job in Ruby with Hadoop.
1 comment:
Have a look to mandy https://github.com/forward/mandy
Seems to be very easy to use hadoop from ruby.
Post a Comment