Saturday, February 18, 2012

Basic searching in Ruby with Solr

Solr is a server application built on top of the Apache Lucene searching engine. It offers a Http interface for storing and querying data.

Internally the way Solr roughly works (and Lucene as it is the engine that powers solr) is by indexing Documents for later searching and retrieval. A Document is described with a collection of Fields, each of this fields can be individually indexed and/or stored on the index.

The index can be built in different ways. The way the index is built is mainly determined by the analyzers used in each field. So an analyzer simply determines the way a particular field will be indexed.

Of course there is a lot of complexity involved in all this, but this is a basic tutorial, and a basic but functional searching solution can be build using defaults for most options.

This tutorial will allow for a search of movies by title and/or Actor using Ruby and Solr. I will assume you already have Ruby installed and the Gem tool as well.

1. Download and install Solr:

2. Decompress it:
 tar zxvf apache-solr-3.5.0-src.tgz

3. Modify the index to accept the kind of documents we want (movies).
In our example we will be able to query movies by title and actors. The index will also store a summary of the movie although it won’t be searchable by that. So we will have three Fields in our Document representing the movie. To reflect this go to the directory:

cd apache-solr-3.5.0/solr/example/solr/conf/

then open the file schema.xml with your favorite editor, go down to the definitions and replace all the ones that are there with the following ones:

you replace the section with the following

  1. <fields>
  2.   <field name="id" type="string" indexed="true" stored="true" required="true" />
  3.   <field name="title" type="text_general" indexed="true" stored="true"/>
  4.   <field name="actor" type="text_general" indexed="true" stored="true" multiValued="true"/>
  5.  <field name="summary" type="text_general" indexed="false" stored="true"/>  
  6. </fields>

Here we are specifying that our movie Documents will have these four fields for searching purposes. We can see that the type we are using for all of them is "text_general". Going up in the schema.xml file we can find a description of what being "text_general" means,
This is extracted directly from that description:

So this is a default provided analyzer that wil be good enough for our purposes (and for many purposes).

The other two thing worth mentioning in our field definitions, is the fact that the "actor" field is multivalued, meaning that we can associate more than one actor to the field, and the fact that the "summary" is stored but not indexed. This means that the content of the field will be stored (so it can be retrieved when documents are retrieved) but it is not indexed (we can't search on this field).

Ok, so this is all the configuration we need in Solr. let's start the server now.
 From the directory apache-solr-3.5.0/example. Execute: java -jar start.jar.

That will start the server and will listen in the port 8983 by default.

Ok, so let's move to Ruby side now. We will create a little program that will index a couple of movies, and then search to find them. First require the needed gem:

gem install rsolr
Then let's create a Movie class in a file named "moviesearch.rb":

  1. class Movie
  2.  attr_accessor :id, :title, :actors, :summary
  3.  def initialize
  4.     @actors = []
  5.  end
  6. end

And now let’s create the indexer and searcher classes in the same file:


require 'rsolr'

  1. class Indexer
  2.  def initialize
  3.     @solr = RSolr.connect :url => 'http://localhost:8983/solr/collection1/'
  4.  end
  5.  def index(movies)
  6.     movies.each do |movie|
  7.      @solr.add :id=>, :title=>movie.title, :actor => movie.actors
  8.     end
  9.     @solr.update :data => '<commit/>'
  10.  end
  11. end


  1. class Searcher
  2.  def initialize
  3.     @solr = RSolr.connect :url => 'http://localhost:8983/solr/collection1/'
  4.  end
  5.  def search(term)
  6.     term = term.downcase
  7.     response = @solr.get 'select', :params => {:q => "title:#{term}* or actor:#{term}*"}
  8.     list = response["response"]["docs"]
  9.     list
  10.  end
  11. end

That’s it.

Let’s test it on irb:

1.9.2-p290 :001 > require './moviesearcher'
=> true
1.9.2-p290 :013 >   movie_1 =
=> #
1.9.2-p290 :014 > mo
module   movie_1
1.9.2-p290 :014 > movie_1.actors << 'Bruce Willis'
=> ["Bruce Willis"]
1.9.2-p290 :015 > movie_1.actors << "Samuel Jackson"
=> ["Bruce Willis", "Samuel Jackson"]
1.9.2-p290 :016 > = '1'
=> "1"
1.9.2-p290 :017 > movie_1.title='Die Hard 3'
=> "Die Hard 3"
1.9.2-p290 :018 > movie_2 =
=> #
1.9.2-p290 :019 > movie_2.actors << 'Mel Gibson'
=> ["Mel Gibson"]
1.9.2-p290 :020 > movie_2.actors << 'Danny Glover'
=> ["Mel Gibson", "Danny Glover"]
1.9.2-p290 :021 > = '2'
=> "2"
1.9.2-p290 :022 > movie_2.title = 'Lethal Weapon'
=> "Lethal Weapon"
1.9.2-p290 :041 >   movie_1.summary = "Great movie"
=> "Great movie"
1.9.2-p290 :042 > movie_2.summary = 'Another great movie'
=> "Another great movie"


1.9.2-p290 :061 >
1.9.2-p290 :080 >   idxr.index [movie_1,movie_2]
=> {"responseHeader"=>{"status"=>0, "QTime"=>50}}


1.9.2-p290 :085 >   searcher =
1.9.2-p290 :086 > 'Die'
=> [{"id"=>"1", "title"=>"Die Hard 3", "actor"=>["Bruce Willis", "Samuel Jackson"]}]
1.9.2-p290 :090 > 'Bru'
=> [{"id"=>"1", "title"=>"Die Hard 3", "actor"=>["Bruce Willis", "Samuel Jackson"]}]

1.9.2-p290 :091 > 'Glo'
=> [{"id"=>"2", "title"=>"Lethal Weapon", "actor"=>["Mel Gibson", "Danny Glover"]}]

1 comment:

DZONEMVB said...

Hi Carlo,

Would you be interested in having this blog featured in the Solr-Lucene Zone at DZone? ping me at mpron [at] dzone [dot] com for some more detail.