Full text search in Ruby on Rails

These days, each web application needs a full text search. Fortunately, there is several handy technologies one can use – native MySQL full text index, Google search, Ferret (ruby port of Lucene) and probably lot more I did not investigate.

We have decided to use ferret as the full text search engine. The decision was not straight forward and it taken some time. Well, in fact in our agile approach we started with the most simple option and it was MySQL full text index.

After some time, we came to technical difficulties and limitations that lead us to ferret. Following chapters summarize our decision process.

General description

MySQL Ferret
MySQL contains a full text engine that could be used for indexing database records.
It works only for MyISAM tables.
Ferret is a ruby port of Lucene engine. It could be used to index any type of document – file, web page, database record etc.

Operational characteristics

Performance

Since I was selecting the engine, I was not able to test it in production. So, the following characteristics were taken from blog of Jayant Kumar.

MySQL Ferret
Query – The full text index is stored in memory. So, it is very fast with small number of documents. Once the index grows beyond the physical memory, the performance goes down. Query – Since it is a port of Lucene with compatibility of indexes, we I expect the performance to be similar – see the blog of Jayant Kumar.
Nevertheless, in case of problems it is possible to use native libraries.

Inserts – are slowed down due to the fact that the new record need to be processed and the index needs to be rebuild in the memory Insert – documents can be added on the fly. Insert does not affect the search
Sorting, total number of documents – extremely slow if the number of resulting documents is high.
131,072 of 327,680 records matched the query – count (size of the result set) taken 1.25 minute.
Sorting, total number of documents – fast

Stability

Here is experience from our development. We had use both technologies and there were no major problems with MySQL and a incompatibility problem with ferret.

MySQL Ferret
Stable, no problems were observed during development. There were some difficulties due to the fact, that ferret was in alpha release when we started to use it. Now its beta release is much more stable.
(Beware of version 0.9.3 on Windows XP in combination with ruby 1.8.4. I went through terrible performance problems.
Version 1.8.2 of ruby works OK.
I did not observe similar problem on Linux)

Development

APIs

Both MySQL and Ferret are easy to use. For fast startup it is better to use MySQL. You can just start working, no additional installation and configuration is needed.

MySQL Ferret
Use of existing MySQL APIs. Necessary to use Ferret APIs. Plugin acts_as_ferret does all the work for you.

Document structure

MySQL Ferret
MySQL stores the full text index as a flat structure. So, it is not possible to address indexed columns in the query.
If there is a need for structured document, there is a possibility to create several full text indexes on the data or to use regular SQL.
It is possible to combine full text index and attributes of the record in the where clause of SQL query.
It is possible to address and query document attributes in the index.

Document size

MySQL Ferret
It is not possible to index varchar fields bigger than 1000 bytes. So, if you want to full text index on two varchar(512) fields, the index creation will fail with error message

“ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes”

Nevertheless it is possible to combine varchar fields with text fields. For example it is possible to create varchar(512) and text fields in the same index.

We did not found the document size limits.

Multi entity indexing

Sometimes it is necessary to index order and order items as one document. In ferret it is easy, in MySQL two indexes needs to be created and used.

MySQL Ferret
It is possible to index only one entity represented by one database record.
Nevertheless, there is an option to create two full text indexed and combine them in SQL query.
Multi entity indexing is possible and easy. In combination with acts_as_ferret plug-in it is extremely simple. Just write into your active record object:

acts_as_ferret :fields => [ ‘order_description’, ‘order_items’]

where order_description and order_items are methods of the Order class.

Query syntax

MySQL Ferret
Querying is built into the SQL engine and is compatible with SQL. Almost no new syntax. New syntax is necessary. Fortunately it is a port of Lucene, so if you know Lucene query syntax, nothing new is requested.

Query types

Again taken from the blog of Jayant Kumar. Writing it on my own would be reinventing a wheel.

MySQL Ferret
  • Boolean – simple boolean query where you can specify if the keyword should be in the document “+keyword” or may not be there “-keyword”.
  • With query expansion – Full-text search supports query expansion (and in particular, its variant “blind query expansion”). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine lacks. For example, a user searching for “database” may really mean that “MySQL”, “Oracle”, “DB2”, and “RDBMS” all are phrases that should match “databases” and should be returned, too. This is implied knowledge.
    Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”.

  • Proximity search – find documents where there is one word between searchword1 and searchword2
  • Wild card search – find documents which have word like searchword* or maybe search?word etc etc…
  • Fuzzy/similarity searches – find documents with words sounding similar to roam~ (will look for roam, foam etc…)
  • Term boosting – you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word “lucene” to be more relevant than those with word “mysql”. Then you can do something like -> lucene^4 mysql .

Pagination

Hard to avoid, especially if you plan to have your web site full of documents…

MySQL Ferret
In MySQL the pagination of full text search is for free. It is the same as any other Active record query.
But in fact, there is a problem with performance if the result is too big. 131,072 of 327,680 records – count (size of the result set) takes 1.25 minute.

Pagination needs to be built. Fortunately there is several HOW-TOs on the web.

Using two different indexes for search

We do not use this feature, but it is fair to say MySQL supports it easily, ferret does not at all.

MySQL Ferret
Possible as any other condition in SQL query. Not possible. One index needs to be created from all of the documents.

Maintenance

Backup and restore

MySQL Ferret
The index is part of the database. So, the backup and restore is part of the database backup. A backup/restore procedure needs to be developed. But in fact, it could be the same as the MyISAM tables backup – using lock tables and copy the files.

Time to recreate index from database data

Sometimes it is necessary to rebuild the index from raw documents (e.g. Adding new field).

MySQL Ferret
Fast, since all data are stored in the DB server Medium to slow. If the data are stored in one table only, the index recreation is medium. If the indexed data belongs to more DB tables, the index recreation is slow and could be very slow if the structure is complex

Miscellaneous

And the rest of the observations that does not fit into other categories.

MySQL Ferret
MySQL indexes by default words longer than 3 letters. This could be changed, but the index will grow.
MySQL comes with pre-defined set of stop words.
If a word is too frequent, it has zero weight and query asking for this word returns empty result set. This is strange and gave me hard time to understand why my query is not working. Especially, working with testing data is tricky.

Ferret index is DB independent.

I hope this will help you to select the right technology for your web.

15 comments

  1. Thanks for a great comparison. I had to go through making the same decision a while ago and I’m sure a lot of people will have a better time figuring out which way to go now. Thanks!

  2. Solr is another Lucene-based option that is becoming popular with some Ruby folks. We even have Ruby output support now (think JSON specifically for Ruby).

  3. This is a great comparison. I have been researching the different types of Ruby search engines to integrate with Rails.

    One thing I am curious about lucene — can you define different weights to the field the terms are found in? It would make sense to weight the title of a post higher then a term being found just in the body of a post, for instance.

  4. Nice comparison !

    But one point I don’t understand: What do you mean by “Using two different indexes for search” ?

    You can specify which fields to search in in your queries, i.e. the query “title:test” will only match documents having ‘test’ in their title. Acts_as_ferret will even let you strip down your search results further with the usual ActiveRecord #find-options hash.

    Besides that, you can create as many Ferret indexes as you like for any special purposes that may occur, but most of the time one index per model (or even one index for all models) will do.

  5. 2 Jens Krämer: Yes, you are right. I can create one index for the whole application.
    Just wanted to say that one should consider very well 2 indexes in ferret, because they cannot be combined.

  6. Hi, inspired by acts_as_ferrit and acts_as_solr, I’m implementing an acts_as_zoom plugin to talk to Z39.50 standard based search databases for full-text search.

    I’ll be using the ruby-zoom (http://ruby-zoom.rubyforge.org/) stuff along with a decent amount of custom code.

    The big difference in my approach is that I’ll not only be allowing for searching of your own site’s Z39.50 search database, but an arbitrary number of databases that are setup as ZoomDb models. In other words, federated search.

    Currently, I have a blog post about setting up a Zebra server and installing ruby-zoom on Debian testing (etch) up at http://blog.katipo.co.nz/?p=26 , but also stay tuned on that blog for more details about acts_as_zoom.

    This for an opensource project called Kete. You can find out more about that here:

    http://blog.kete.net.nz/

    Thanks for the comparison.

    Cheers,
    Walter McGinnis

  7. thanks a lot – this document was *exactly* what I was looking for. Thank you so much for your effort in creating it.

  8. hi, i got a problem in mysql phrase search. i got two hundrad thousand row with text column in my database. its fulltext indexing structure, but when i going to search by phrase search in that by this query :
    “select file_name from database_table where match(field1) against(‘”phrase search word”‘ in boolean mode)”
    it take a too much time almost 1 and half minute. its same time in word search. so what could i do to search fast. plz give m replay

  9. 2 Ashish Dave

    It is hardly to say what is the reason. One of them might be the amount of memory you have on your machine.

    Here is what Jayant Kumar writes on his blog: “Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. “

    It is also connected with minimum length of word to be indexed. Default is 4. Smaller words are not indexed. For more details see http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html

  10. You should consider PostGres full text search – multiple fields are easy to index, results for multi-word searches can be prioritized by proximity, conditionals are easy, it has base word dictionaries (so it can match runner and run), etc.

  11. Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It’s also highly scalable.

    The best thing is, it’s super easy. You can create a production-level search in 3 minutes.

Comments are closed.