Bayes classification in Ruby made easy

Recently I was experimenting with ruby bayes classification. At first sight it looks like a difficult topic, but with the right libraries it is interesting and funny.

Before you start experimenting, you have to install 3 gems.


Confirm the required stemmer gem.

For the beginning, lets experiment with the plain bayes classifier.

1
2
3
4
5
6
7
8
require 'classifier'
 
bayes = Classifier::Bayes.new 'funny', 'sad', 'neutral'
 
# Train it slightly...
bayes.train 'funny', 'Finally all of them were smiling'
bayes.train :sad, 'Little ill puppy'
bayes.train :neutral, 'Tax declaration'

The classifier is “trained”, so lets ask it something interesting…

1
2
bayes.classify 'Everybody have to pay taxes'
=> "Neutral"

Hmmm… this does not look like the expected answer :o). We have probably trained it incorrectly. So, let’s undo it:

1
2
3
4
5
6
7
8
# Remove the incorrect statement
bayes.untrain :neutral, 'Tax declaration'
 
# Train it right
bayes.train :sad, 'Tax declaration'
 
# And provide something neutral (if there is no statement for a category, the classifier does not work as expected.
bayes.train :neutral, 'Rainbow is full of colors'

So, how does the classifier sees it now?

1
2
bayes.classify 'Everybody have to pay taxes'
=>'Sad'

Yes, this is how people feel it :o). For those who does not agree (and also for debugging purposes) it is possible to see score for each category.

1
2
bayes.classifications 'Everybody have to pay taxes'
=> {"Sad"=>-9.43348392329039, "Neutral"=>-10.2035921449865, "Funny"=>-10.2035921449865}

The classifier that was created and trained is nice, but disappears as soon as you stop your ruby console. To make it more persistent, you have to use Madeleine class.
“Madeleine is a Ruby implementation of Object Prevalence, that is, transparent persistence of business objects using command logging and complete system snapshots.”

require ‘madeleine’

  1. Store the data into bayes-dir directory
    madeleine = SnapshotMadeleine.new(“bayes-dir”) { bayes }
    madeleine.take_snapshot

    Next time load the classifier with command

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    madeleine = SnapshotMadeleine.new("bayes-dir")
     
    # Perform more training
    madeleine.system.train "sad", "Many people were injured by the earthquake"
     
    # And test it once more
    madeleine.system.classify 'smiling face'
    =>'funny'
    madeleine.system.classify 'strong earthquake'
    =>'sad'

    The classifier is a nice piece of code. I did enjoy it, and hope you will enjoy it too.

Utilizing Caches.rb with Ferret

We needed to cache a Ruby class method calling the Ferret indexing engine.
Yurii Rashkovskii developed a great library called Caches.rb.

When I googled it out, it seemed very simple to use and promised to do EXACTLY what I need (even the default timeout was JUST IT). I especially liked the very Rails-like tutorial Don’t tell, show me!. However, it required quite some effort to make it work, mainly because of the rather sparse documentation. Still, in the end the usage is very elegant, the solution is simple and it does what it promises. Thank you, Yurii!

To help our esteemed readers get faster over that less agreeable middle phase, here are a few tips:

  • Downloading it: I tried gems but the gem list server seemed to be overloaded, and when it worked at last, I just got an older version (0.2.0). When checking out (or exporting) version 0.4.0 from SVN trunk directly, the trick was in finding out the latest working SVN URL:
    ruby script/plugin install http://svn.verbdev.com/rb/caches.rb/trunk

  • With all the typical Rails mixin stuff petrified in my mind, it took me a while to notice that caching should be declared AFTER the definition of the method to be cached, and not at the beginning of the class definition.
  • The example in the documentation shows it, but it’s easily overseen.

  • We do use Rails, so I included class_cache_storage Caches::Storage::Global as suggested here
  • For some reason (I suspect my shallow knowledge of Ruby ;-), I did not manage to successfully require ‘caches.rb’ from the plugin installation dir _#{RAILS_ROOT}/vendor/plugins/caches.rb/lib_, so I copied it to _#{RAILS_ROOT}/lib_, which helped.
  • For similar reasons, I had to use a workaround to extend the class definition with caching, instead of the recommended way of extending conf/environment.rb by ActiveRecord::Base.extend Caches::ClassMethods

So our class looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
require 'ferret'
include Ferret
require 'caches.rb'
 
class FIndex
  extend Caches
  ...
  def self.search(user=nil, results_per_page=10)
    ...
  end
  class_cache_storage Caches::Storage::Global
  class_caches :search
end

N.B.: The class to be cached works with Ferret and not the DB, so it did not inherit from ActiveRecord.

  • For a pure RoR developer, the terminology may be a little confusing (which indicates that Yurii can deal with more programming languages than just Ruby):
    • class methods mentioned in the library description are obviously just a shorthand for “the methods of a class”
  • in Yurii’s documentation, Ruby class methods are called static methods (as known in C/C++ or Java), and their caching is supported by class_caches (feature not available in caches.rb 0.2.0 from gems or RubyForge, but included in the more recent versions from SVN, like 0.4.0 we are using)

How to show performance statistics in a chart

Working on performance tests and tuning is an interesting and challenging job. One has to prepare tests, execute and also evaluate them. The preparation and execution is not always easy, but it is manageable. The evaluation of the tests comes with an interesting problem. The problem is called data visualization.

I was testing a system spread across several servers and there were several technologies involved (Microsoft perfmon, Apache jMeter) in the testing and monitoring. Therefore, the visualization was a challenge.

It is not a problem to put all the statistics into one spreadsheet and try to create a chart. BUT there is usually a lot of tests and the preparation of each chart takes ages. Secondly, the traditional spreadsheets are good to show sales for four regions, but they have problems with hundreds or thousands of data samples. The chart looks like a saw (in MS Excel) or is very CPU intensive (OpenOffice.org). So, automation and a good visualization technique is needed.

I selected gnuplot for the visualization. It is fast, it creates loads of different charts and it is managed from a script console… so, automation is possible. In my approach, the preparation of data files and generation of the gnuplot script is done using ruby scripting language. The resulting chart is nice and its generation took me just a few seconds, including deciding what should be displayed.

Chart

For those with no patience, the script is available for download here.

For the others, here is a short summary of what the script is doing for you. It …

  • Takes several CSV input files as parameters. Thus you can have a chart with data from different servers.
  • Lists all the columns that are in the given CSV file.
  • Gives you the possibility to select the columns of interest.
  • Allows you to specify time format of the timestamp in the file. Default is set to the perfmon format.
  • Scales the data to show all of the data sets in the same chart.
  • Generates a gnuplot script for you. You can then customize the script to your needs.

How it works

0. Create a “temp” directory in the directory where you run the script.

1. Run

ruby process.rb  SERVER3.csv

2. The script then lists all the columns from the CSV file

Select fields (time first, separated by space) from file SERVER3.csv
0 : "(PDH-CSV 4.0) (Central Europe Standard Time)(-60)"
1 : "\\SERVER3\LogicalDisk(C:)\Disk Bytes/sec"
2 : "\\SERVER3\LogicalDisk(D:)\Disk Bytes/sec"
3 : "\\SERVER3\Memory\Available MBytes"
4 : "\\SERVER3\Memory\Page Faults/sec"
5 : "\\SERVER3\Network Interface(HP NC7782 Gigabit Server Adapter)\Bytes Total/sec"
6 : "\\SERVER3\Network Interface(MS TCP Loopback interface)\Bytes Total/sec"
7 : "\\SERVER3\PhysicalDisk(0 C:)\Disk Bytes/sec"
8 : "\\SERVER3\PhysicalDisk(1 D:)\Disk Bytes/sec"
9 : "\\SERVER3\Process(java)\Page Faults/sec"
10 : "\\SERVER3\Processor(0)\% Processor Time"
11 : "\\SERVER3\Processor(1)\% Processor Time"
12 : "\\SERVER3\Processor(2)\% Processor Time"
13 : "\\SERVER3\Processor(3)\% Processor Time"
14 : "\\SERVER3\Processor(4)\% Processor Time"
15 : "\\SERVER3\Processor(5)\% Processor Time"
16 : "\\SERVER3\Processor(6)\% Processor Time"
17 : "\\SERVER3\Processor(7)\% Processor Time"
18 : "\\SERVER3\Processor(_Total)\% Processor Time"
...
36 : "Server monitoring"
Enter your options:

3. Enter the fields you want in the chart. The timestamp field needs to be the first one; fields are separated by spaces.

0 4 9 18

4. Specify the time format. The default is set to perfmon (US English Windows). Press enter, if you want to use the default.

Specify time format. Put HH:mm:SS into round brackets. Default is: DD/MM/YYYY (HH:mm:SS).XXX
DD/MM/YYYY (HH:mm:SS).XXX

5. The script is generated.

gnuplot script created. Generate a picture with the following command
pgnuplot script.plt

6. Edit the script if necessary. If you do not edit the script, you will get a chart like the one above.

set terminal png
set output "chart.png"
set xdata time
set timefmt "%H:%M:%S"
set format x "%H:%M"
set yrange [0:100]
cd "temp"
plot "SERVER3.csv.tmp" using 1:(0.01*$2) with lines title "0.01 * \\\\SERVER3\\Memory\\Page Faults/sec" smooth bezier, \
"SERVER3.csv.tmp" using 1:(0.01*$3) with lines title "0.01 * \\\\SERVER3\\Process(java)\\Page Faults/sec" smooth bezier, \
"SERVER3.csv.tmp" using 1:(1*$4) with lines title "1 * \\\\SERVER3\\Processor(_Total)\\% Processor Time" smooth bezier

7. Generate the chart with the following command

pgnuplot script.plt

8. Check file chart.png

Limitations

  • The script is very simple and definitely will not work for all data files. Nevertheless, for an experienced ruby programmer it should not be a problem to modify it.
  • Hopefully there are not many bugs in there. If you find some, give me a note (comment under this article is appreciated).

I hope you will find the script useful.

Syntax error without obvious reason

I just made a bad experience with ruby I would like to share.

I wanted to upgrade my SuSE Linux 9.1 to version 10. Failed… during the upgrade process my machine ended up in a strange state. I was not able to boot neither from hard drive nor from Windows XP installation disk. I had to use my old Win 2000 installation CD to make it running.

As a consequence, I had to recreate the whole development environment. After few hours of work I started my application…

Before the installation it was running without any problem. After the environment was recreated, I was not able to render some views. It showed me strange syntax errors:

ActionView::TemplateError (compile error
./script/../config/../app/views/story/_showing_tags.rhtml:4: syntax error
_erbout.concat "	"
                	  ^) on line #4 of app/views/story/_showing_tags.rhtml:
…
…
…
_erbout.concat "                        ";  end ; _erbout.concat "\n"
_erbout.concat "</div>\n"
_erbout
end
Backtrace: ./script/../config/../app/views/story/_showing_tags.rhtml:4:in `compile_template'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_view/base.rb:307:in `compile_and_render_template'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_view/base.rb:292:in `render_template'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_view/base.rb:251:in `render_file'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_view/base.rb:266:in `render'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_view/partials.rb:59:in `render_partial'
c:/win32app/ruby/lib/ruby/gems/1.8/gems/actionpack-1.12.5/lib/action_controller/benchmarking.rb:29:in `benchmark'
c:/win32app/ruby/lib/ruby/1.8/benchmark.rb:293:in `measure'
…
…
...

In fact, the rails were complaining about the tabs before the embedded ruby commands

After changing the tabs to 2 spaces the application started… complaining about another characters.

I did some research and I was pointed to a ruby version. Before the reinstallation I was running 1.8.2. After the installation I took the latest one. Bad decision. It was caused by the ruby interpreter version.

So, if you face weird syntax errors with ruby 1.8.4 and you are sure about the syntax, try to change the version. I downgraded to 1.8.2 and it started to work again.