These days, each web application needs a full text search. Fortunately, there is several handy technologies one can use – native MySQL full text index, Google search, Ferret (ruby port of Lucene) and probably lot more I did not investigate.
We have decided to use ferret as the full text search engine. The decision was not straight forward and it taken some time. Well, in fact in our agile approach we started with the most simple option and it was MySQL full text index.
After some time, we came to technical difficulties and limitations that lead us to ferret. Following chapters summarize our decision process.
General description
MySQL | Ferret |
---|---|
MySQL contains a full text engine that could be used for indexing database records. It works only for MyISAM tables. |
Ferret is a ruby port of Lucene engine. It could be used to index any type of document – file, web page, database record etc. |
Operational characteristics
Performance
Since I was selecting the engine, I was not able to test it in production. So, the following characteristics were taken from blog of Jayant Kumar.
MySQL | Ferret |
---|---|
Query – The full text index is stored in memory. So, it is very fast with small number of documents. Once the index grows beyond the physical memory, the performance goes down. |
Query – Since it is a port of Lucene with compatibility of indexes, we I expect the performance to be similar – see the blog of Jayant Kumar. Nevertheless, in case of problems it is possible to use native libraries. |
Inserts – are slowed down due to the fact that the new record need to be processed and the index needs to be rebuild in the memory | Insert – documents can be added on the fly. Insert does not affect the search |
Sorting, total number of documents – extremely slow if the number of resulting documents is high. 131,072 of 327,680 records matched the query – count (size of the result set) taken 1.25 minute. |
Sorting, total number of documents – fast |
Stability
Here is experience from our development. We had use both technologies and there were no major problems with MySQL and a incompatibility problem with ferret.
MySQL | Ferret |
---|---|
Stable, no problems were observed during development. |
There were some difficulties due to the fact, that ferret was in alpha release when we started to use it. Now its beta release is much more stable. (Beware of version 0.9.3 on Windows XP in combination with ruby 1.8.4. I went through terrible performance problems. Version 1.8.2 of ruby works OK. I did not observe similar problem on Linux) |
Development
APIs
Both MySQL and Ferret are easy to use. For fast startup it is better to use MySQL. You can just start working, no additional installation and configuration is needed.
MySQL | Ferret |
---|---|
Use of existing MySQL APIs. | Necessary to use Ferret APIs. Plugin acts_as_ferret does all the work for you. |
Document structure
MySQL | Ferret |
---|---|
MySQL stores the full text index as a flat structure. So, it is not possible to address indexed columns in the query. If there is a need for structured document, there is a possibility to create several full text indexes on the data or to use regular SQL. It is possible to combine full text index and attributes of the record in the where clause of SQL query. |
It is possible to address and query document attributes in the index. |
Document size
MySQL | Ferret |
---|---|
It is not possible to index varchar fields bigger than 1000 bytes. So, if you want to full text index on two varchar(512) fields, the index creation will fail with error message
“ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes” Nevertheless it is possible to combine varchar fields with text fields. For example it is possible to create varchar(512) and text fields in the same index. |
We did not found the document size limits. |
Multi entity indexing
Sometimes it is necessary to index order and order items as one document. In ferret it is easy, in MySQL two indexes needs to be created and used.
MySQL | Ferret |
---|---|
It is possible to index only one entity represented by one database record. Nevertheless, there is an option to create two full text indexed and combine them in SQL query. |
Multi entity indexing is possible and easy. In combination with acts_as_ferret plug-in it is extremely simple. Just write into your active record object:
acts_as_ferret :fields => [ ‘order_description’, ‘order_items’] where order_description and order_items are methods of the Order class. |
Query syntax
MySQL | Ferret |
---|---|
Querying is built into the SQL engine and is compatible with SQL. Almost no new syntax. | New syntax is necessary. Fortunately it is a port of Lucene, so if you know Lucene query syntax, nothing new is requested. |
Query types
Again taken from the blog of Jayant Kumar. Writing it on my own would be reinventing a wheel.
MySQL | Ferret |
---|---|
|
|
Pagination
Hard to avoid, especially if you plan to have your web site full of documents…
MySQL | Ferret |
---|---|
In MySQL the pagination of full text search is for free. It is the same as any other Active record query. But in fact, there is a problem with performance if the result is too big. 131,072 of 327,680 records – count (size of the result set) takes 1.25 minute. |
Pagination needs to be built. Fortunately there is several HOW-TOs on the web. |
Using two different indexes for search
We do not use this feature, but it is fair to say MySQL supports it easily, ferret does not at all.
MySQL | Ferret |
---|---|
Possible as any other condition in SQL query. | Not possible. One index needs to be created from all of the documents. |
Maintenance
Backup and restore
MySQL | Ferret |
---|---|
The index is part of the database. So, the backup and restore is part of the database backup. | A backup/restore procedure needs to be developed. But in fact, it could be the same as the MyISAM tables backup – using lock tables and copy the files. |
Time to recreate index from database data
Sometimes it is necessary to rebuild the index from raw documents (e.g. Adding new field).
MySQL | Ferret |
---|---|
Fast, since all data are stored in the DB server | Medium to slow. If the data are stored in one table only, the index recreation is medium. If the indexed data belongs to more DB tables, the index recreation is slow and could be very slow if the structure is complex |
Miscellaneous
And the rest of the observations that does not fit into other categories.
MySQL | Ferret |
---|---|
MySQL indexes by default words longer than 3 letters. This could be changed, but the index will grow. MySQL comes with pre-defined set of stop words. If a word is too frequent, it has zero weight and query asking for this word returns empty result set. This is strange and gave me hard time to understand why my query is not working. Especially, working with testing data is tricky. |
Ferret index is DB independent. |
I hope this will help you to select the right technology for your web.
Thanks for a great comparison. I had to go through making the same decision a while ago and I’m sure a lot of people will have a better time figuring out which way to go now. Thanks!
Solr is another Lucene-based option that is becoming popular with some Ruby folks. We even have Ruby output support now (think JSON specifically for Ruby).
This is a great comparison. I have been researching the different types of Ruby search engines to integrate with Rails.
One thing I am curious about lucene — can you define different weights to the field the terms are found in? It would make sense to weight the title of a post higher then a term being found just in the body of a post, for instance.
Great and interessting comparison. A good information source when you have to decide what you want to use.
Thanks for that!
Excellent comparison with lucense. I think it helps people understand better the technology they are getting themselves into.
Nice comparison !
But one point I don’t understand: What do you mean by “Using two different indexes for search” ?
You can specify which fields to search in in your queries, i.e. the query “title:test” will only match documents having ‘test’ in their title. Acts_as_ferret will even let you strip down your search results further with the usual ActiveRecord #find-options hash.
Besides that, you can create as many Ferret indexes as you like for any special purposes that may occur, but most of the time one index per model (or even one index for all models) will do.
2 Michael @ SEOG:
It seems, that there is some kind of field boosting, we currently does not use it. See http://www.mail-archive.com/ferret-talk@rubyforge.org/msg00638.html
2 Jens Krämer: Yes, you are right. I can create one index for the whole application.
Just wanted to say that one should consider very well 2 indexes in ferret, because they cannot be combined.
Hi, inspired by acts_as_ferrit and acts_as_solr, I’m implementing an acts_as_zoom plugin to talk to Z39.50 standard based search databases for full-text search.
I’ll be using the ruby-zoom (http://ruby-zoom.rubyforge.org/) stuff along with a decent amount of custom code.
The big difference in my approach is that I’ll not only be allowing for searching of your own site’s Z39.50 search database, but an arbitrary number of databases that are setup as ZoomDb models. In other words, federated search.
Currently, I have a blog post about setting up a Zebra server and installing ruby-zoom on Debian testing (etch) up at http://blog.katipo.co.nz/?p=26 , but also stay tuned on that blog for more details about acts_as_zoom.
This for an opensource project called Kete. You can find out more about that here:
http://blog.kete.net.nz/
Thanks for the comparison.
Cheers,
Walter McGinnis
thanks a lot – this document was *exactly* what I was looking for. Thank you so much for your effort in creating it.
hi, i got a problem in mysql phrase search. i got two hundrad thousand row with text column in my database. its fulltext indexing structure, but when i going to search by phrase search in that by this query :
“select file_name from database_table where match(field1) against(‘”phrase search word”‘ in boolean mode)”
it take a too much time almost 1 and half minute. its same time in word search. so what could i do to search fast. plz give m replay
2 Ashish Dave
It is hardly to say what is the reason. One of them might be the amount of memory you have on your machine.
Here is what Jayant Kumar writes on his blog: “Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. “
It is also connected with minimum length of word to be indexed. Default is 4. Smaller words are not indexed. For more details see http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html“
You should consider PostGres full text search – multiple fields are easy to index, results for multi-word searches can be prioritized by proximity, conditionals are easy, it has base word dictionaries (so it can match runner and run), etc.
Please take a look at this
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.
You can easily index, re-index, incremental-index. It’s also highly scalable.
The best thing is, it’s super easy. You can create a production-level search in 3 minutes.
I am a newbie here, just kidding around:-) anyone from Manhattan?