Archive for 2008

Ignore files in subversion

This is just a simple procedure how to tell subversion to ignore files or directories.

cd parent_directory
 
# Check the current setup
svn proplist -v .
 
# set the editor to edit the properties
export EDITOR=vi
 
# open up editor with the properties
svn propedit svn:ignore .

Text editor opens (vi in my case) Here you have to specify files and directories to be ignored. E.g.

docs
*.log

and last, but not least, commit.

svn commit

Converting family to Linux

I managed to convert my family to Linux. Not intentionally, it just happened. The final confirmation came on Saturday. My older son(5) was “coaching” the younger one(2). “Do not boot to Windows, there are no games there!”

And how did I do it?

  • Do not install games for kids to Windows
  • Do not install MS Office on Windows
  • Set Linux as the default operating system

As I said, it was no intention. It was just my laziness. :o)

Improve performance of MySQL driver for RoR

Last week I was working on the performance tuning of a rails application. I ran a profiler and found something very interesting.

I found that there is a procedure that is called very often and takes a lot of time.

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 14.99    18.13     18.13    43430     0.42     0.85  Mysql#get_length
  8.84    28.82     10.69   172751     0.06     0.09  Kernel.===
  8.40    38.98     10.16   306964     0.03     0.04  Fixnum#==
  7.59    48.16      9.18     5566     1.65    30.60  Integer#times
  6.42    55.92      7.76     7582     1.02     1.63  Mysql::Net#read
...
  0.00   120.92      0.00        1     0.00 120920.00  #toplevel

The procedure was Mysql#get_length:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# File src/rails-1.2.3/activerecord/lib/active_record/vendor/mysql.rb
  def get_length(data, longlong=nil)
    return if data.length == 0
    c = data.slice!(0)
    case c
    when 251
      return nil
    when 252
      a = data.slice!(0,2)
      return a[0]+a[1]*256
    when 253
      a = data.slice!(0,3)
      return a[0]+a[1]*256+a[2]*256**2
    when 254
      a = data.slice!(0,8)
      if longlong then
        return a[0]+a[1]*256+a[2]*256**2+a[3]*256**3+
          a[4]*256**4+a[5]*256**5+a[6]*256**6+a[7]*256**7
      else
        return a[0]+a[1]*256+a[2]*256**2+a[3]*256**3
      end
    else
      c
    end
  end

There is obviously space for improvement! Replace the multiplications by shifts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class Mysql
  def get_length(data, longlong=nil)
    return if data.length == 0
    c = data.slice!(0)
 
    case c
    when 251
      return nil
    when 252
      a = data.slice!(0,2)
      return a[0]+(a[1]<<8)
    when 253
      a = data.slice!(0,3)
      return a[0]+(a[1]<<8)+(a[2]<<16)
    when 254
      a = data.slice!(0,8)
      if longlong then
        return a[0]+(a[1]<<8)+(a[2]<<16) +(a[3]<<24)+(a[4]<<32)+(a[5]<<40)+(a[6]<<48)+(a[7]<<56)
      else
        return a[0]+(a[1]<<8)+(a[2]<<16)+(a[3]<<24)
      end
    else
      c
    end
  end
end

I ran the profiler again and, well… a wisdom of my university times popped on my mind: “There is an elegant, simple, nice and obvious solution for each problem. Unfortunately, it is wrong!”

Performance remained the same. So, deeper investigation is needed! It was not difficult to find out that most of the times the “else” branch is executed. I tried something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  def get_length(data, longlong=nil)
    return if data.length == 0
    c = data.slice!(0)
 
    return c if c<251
 
    case c
    when 251
      return nil
    when 252
      a = data.slice!(0,2)
      return a[0]+(a[1]<<8)
    when 253
      a = data.slice!(0,3)
      return a[0]+(a[1]<<8)+(a[2]<<16)
    when 254
      a = data.slice!(0,8)
      if longlong then
        return a[0]+(a[1]<<8)+(a[2]<<16) +(a[3]<<24)+(a[4]<<32)+(a[5]<<40)+(a[6]<<48)+(a[7]<<56)
      else
        return a[0]+(a[1]<<8)+(a[2]<<16)+(a[3]<<24)
      end
    else
      c
    end
  end

And the performance?

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 11.26    11.25     11.25    43430     0.26     0.38  Mysql#get_length
  8.73    19.97      8.72     5566     1.57    23.25  Integer#times
  7.77    27.73      7.76     7582     1.02     1.63  Mysql::Net#read
  7.27    34.99      7.26     5140     1.41     3.69  Array#each
  5.52    40.50      5.51   139526     0.04     0.05  Fixnum#==
...
  0.00    99.88      0.00        1     0.00 99880.00  #toplevel

Improved! The toplevel cumulative time is down by over 20 seconds.
Now, here’s how you can embed this hack into your application:

  • Create a file called e.g. mysql_fix.rb
  • Add there
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
require 'active_record/vendor/mysql'
 
class Mysql
  def get_length(data, longlong=nil)
    return if data.length == 0
    c = data.slice!(0)
 
    return c if c < 251
 
    case c
    when 251
      return nil
    when 252
      a = data.slice!(0,2)
      return a[0]+(a[1]<<8)
    when 253
      a = data.slice!(0,3)
      return a[0]+(a[1]<<8)+(a[2]<<16)
    when 254
      a = data.slice!(0,8)
      if longlong then
        return a[0]+(a[1]<<8)+(a[2]<<16) +(a[3]<<24)+(a[4]<<32)+(a[5]<<40)+(a[6]<<48)+(a[7]<<56)
      else
        return a[0]+(a[1]<<8)+(a[2]<<16)+(a[3]<<24)
      end
    else
      c
    end
  end
end
  • Put it into e.g. lib/zmok directory
  • Add into your environment.rb following line
1
require File.join(File.dirname(__FILE__), '../lib/zmok/mysql_fix')

It is funny how my performance tuning session ended. Instead of changing my Rails application, I ended up improving the MySQL driver.

Bayes classification in Ruby made easy

Recently I was experimenting with ruby bayes classification. At first sight it looks like a difficult topic, but with the right libraries it is interesting and funny.

Before you start experimenting, you have to install 3 gems.

gem install classifier
gem install madeleine

Confirm the required stemmer gem.

For the beginning, lets experiment with the plain bayes classifier.

1
2
3
4
5
6
7
8
require 'classifier'
 
bayes = Classifier::Bayes.new 'funny', 'sad', 'neutral'
 
# Train it slightly...
bayes.train 'funny', 'Finally all of them were smiling'
bayes.train :sad, 'Little ill puppy'
bayes.train :neutral, 'Tax declaration'

The classifier is “trained”, so lets ask it something interesting…

1
2
bayes.classify 'Everybody have to pay taxes'
=> "Neutral"

Hmmm… this does not look like the expected answer :o). We have probably trained it incorrectly. So, let’s undo it:

1
2
3
4
5
6
7
8
# Remove the incorrect statement
bayes.untrain :neutral, 'Tax declaration'
 
# Train it right
bayes.train :sad, 'Tax declaration'
 
# And provide something neutral (if there is no statement for a category, the classifier does not work as expected.
bayes.train :neutral, 'Rainbow is full of colors'

So, how does the classifier sees it now?

1
2
bayes.classify 'Everybody have to pay taxes'
=>'Sad'

Yes, this is how people feel it :o). For those who does not agree (and also for debugging purposes) it is possible to see score for each category.

1
2
bayes.classifications 'Everybody have to pay taxes'
=> {"Sad"=>-9.43348392329039, "Neutral"=>-10.2035921449865, "Funny"=>-10.2035921449865}

The classifier that was created and trained is nice, but disappears as soon as you stop your ruby console. To make it more persistent, you have to use Madeleine class.
“Madeleine is a Ruby implementation of Object Prevalence, that is, transparent persistence of business objects using command logging and complete system snapshots.”

require ‘madeleine’

  1. Store the data into bayes-dir directory
    madeleine = SnapshotMadeleine.new(“bayes-dir”) { bayes }
    madeleine.take_snapshot

    Next time load the classifier with command

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    madeleine = SnapshotMadeleine.new("bayes-dir")
     
    # Perform more training
    madeleine.system.train "sad", "Many people were injured by the earthquake"
     
    # And test it once more
    madeleine.system.classify 'smiling face'
    =>'funny'
    madeleine.system.classify 'strong earthquake'
    =>'sad'

    The classifier is a nice piece of code. I did enjoy it, and hope you will enjoy it too.

Ad-hoc fulltext search in RoR ActiveRecord

I came to a situation where I needed to search my Active record, but I did not know which field contains the information. The solution with Ferret was just three steps away…

Let’s say, you want to search Stories for ‘Giant’ keyword. You have to create a Ferret index in memory (ferret gem needs to be installed), index all active records and gather all IDs matching the keyword.

1
2
3
4
5
6
7
index=Ferret::I.new
 
Story.find(:all).each { |s| index << {:id=>s.id, :content=>s.inspect} }
 
index.search_each('Giant', :limit=>100) do |id, score| 
  puts "Active record ID: #{index[id][:id]} with score #{score}"
end

… now you have the full power of the Ferret engine in your hands.