Doing LSI at scale in Ruby

C

Chris Kottom

[Note: parts of this message were removed to make it a legal post.]

Hi all,

I'm looking to find out whether anyone is doing latent semantic indexing
(LSI) in Ruby at any kind of web scale, and if so, what tools and techniques
you're using?

Just for context, I've been working on this problem for a few days now.
I've tried the Classifier gem via "gem install" and compiled from source
and at least two other forks. I've tried compiling various versions of the
GSL library, most of which would not allow the gsl gem to compile, and it
seems that in the combinations where I can actually get the full set of
libraries to install, I receive an error like the following:

/home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:316:in
`SV_decomp': Ruby/GSL error code 24, svd of MxN matrix, M<N, is not
implemented (file svd.c, line 61), the requested feature is not (yet)
implemented (GSL::ERROR::EUNIMPL)
from /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:316:in
`build_reduced_matrix'
from /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:128:in
`build_index'
from /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:66:in
`add_item'
from lsi_test.rb:18:in `block in <main>'
from lsi_test.rb:18:in `each'
from lsi_test.rb:18:in `<main>'

This particular stack trace was when running with a fork of Classifier, but
the result is essentially the same with the original gem with the exception
of the line numbers, and it looks as though the error is unrelated to
Classifier but rather the gsl gem or the underlying GSL library.

Any help or shared experiences will be appreciated. Thanks in advance.
 
R

Ryan Davis

Just for context, I've been working on this problem for a few days = now.
I've tried the Classifier gem via "gem install" and compiled from = source
and at least two other forks. I've tried compiling various versions = of the
GSL library, most of which would not allow the gsl gem to compile, and = it
seems that in the combinations where I can actually get the full set = of
libraries to install, I receive an error like the following:
=20
= /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-=
1.4.4/lib/classifier/lsi.rb:316:in
`SV_decomp': Ruby/GSL error code 24, svd of MxN matrix, M<N, is not
implemented (file svd.c, line 61), the requested feature is not (yet)
implemented (GSL::ERROR::EUNIMPL)
from = /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-=
1.4.4/lib/classifier/lsi.rb:316:in
`build_reduced_matrix'
from = /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-=
1.4.4/lib/classifier/lsi.rb:128:in
`build_index'
from = /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-=
1.4.4/lib/classifier/lsi.rb:66:in
`add_item'
from lsi_test.rb:18:in `block in <main>'
from lsi_test.rb:18:in `each'
from lsi_test.rb:18:in `<main>'
=20
This particular stack trace was when running with a fork of = Classifier, but
the result is essentially the same with the original gem with the = exception
of the line numbers, and it looks as though the error is unrelated to
Classifier but rather the gsl gem or the underlying GSL library.

You'd be better off contacting the author. There is no guarantee that =
they read this list.
 
C

Clifford Heath

I'm looking to find out whether anyone is doing latent semantic indexing
(LSI) in Ruby at any kind of web scale, and if so, what tools and techniques
you're using?

The author of Picky <http://florianhanke.com/picky/> presented it last night
at the Melbourne Ruby group. Not sure if it's interesting to you, but it looks
like a different kind of search engine to Sphinx, etc.

Clifford Heath.
 
K

Karl Smith

Starting about a week ago, ruby is crashing fairly often during rails =
development: rails server, console, and during spec runs. But it's not =
consistent.=20

$ ruby -v=20
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]=20
$ rvm -v=20
rvm 1.6.10 by Wayne E. Seguin ([email protected]) [https://=20
rvm.beginrescueend.com/]=20

Please take a look at some of the crash logs: =
https://gist.github.com/982118=20

I have tried the following, but did not help:=20
- remove all gems and re-bundle=20
- uninstall ruby 1.9.2-p180 and re-install=20
- use ruby 1.9.2-p136 with new gem re-bundle=20

At first I was startled because I have never seen ruby crash before on =
this machine. Now that the novelty has worn thin, it's becoming quite a =
distraction.=20

Since everything was working just a few days ago, I'm stumped as to what =
may be causing this. I need help tracking down the cause.
 
R

Ryan Davis

Starting about a week ago, ruby is crashing fairly often during rails =
development: rails server, console, and during spec runs. But it's not =
consistent.=20

Please don't thread hijack. Start a new thread properly.=
 
P

Phillip Gawlowski

At first I was startled because I have never seen ruby crash before on th= is machine.
Now that the novelty has worn thin, it's becoming quite a distraction.

Since everything was working just a few days ago, I'm stumped as to what =
may be causing this. I need help tracking down the cause.

Well, what did change in your system environment in the last few days?

Looking at the crash log, and considering the crash happens after an
SQL statement: What database are you using, and what is its version?
And what's your Rails/ActiveRecord version?

If possible, build an application with the *minimum* set of external
libraries that still produces a crash.

--=20
Phillip Gawlowski

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-- Leibnitz
 
R

Ryan Davis

what may be causing this. I need help tracking down the cause.
=20
Well, what did change in your system environment in the last few days?
=20
Looking at the crash log, and considering the crash happens after an
SQL statement: What database are you using, and what is its version?
And what's your Rails/ActiveRecord version?
=20
If possible, build an application with the *minimum* set of external
libraries that still produces a crash.

This seems a lot more relevant:

c:0042 p:---- s:0136 b:0136 l:000135 d:000135 CFUNC :require
c:0041 p:0012 s:0132 b:0132 l:000116 d:000131 BLOCK =
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:239
c:0040 p:0005 s:0130 b:0130 l:000121 d:000129 BLOCK =
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:225
c:0039 p:0045 s:0128 b:0128 l:000127 d:000127 METHOD =
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:596
c:0038 p:0041 s:0122 b:0122 l:000121 d:000121 METHOD =
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:225
c:0037 p:0013 s:0117 b:0117 l:000116 d:000116 METHOD =
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:239
c:0036 p:0011 s:0112 b:0112 l:000111 d:000111 TOP =
blah/gems/ruby_parser-2.0.6/lib/ruby_parser.rb:7
c:0035 p:---- s:0110 b:0110 l:000109 d:000109 FINISH
c:0034 p:---- s:0108 b:0108 l:000107 d:000107 CFUNC :require

Which looks to be:

require 'racc/parser.rb'

Which ships with ruby... Something is brokey with racc itself? I =
dunno...=
 
C

Chris Kottom

[Note: parts of this message were removed to make it a legal post.]

Thanks, Ryan. I will do this too, was just looking to see what the current
de facto standard method for this is. Digging a little deeper on both
GitHub and RubyForge, it seems that the gem has been pretty much dormant for
several years, so I'm looking to see whether people have moved on to another
fork or another lib. Will post any findings.

Thanks, Clifford, for the tip. It's not exactly what I need for this
particular part of the application, as I'm using the Classifier LSI feature
to index documents and detect similar records, but it might be worth
investigating as a replacement for Sphinx in other places in this app and
others I'm working on.
 
K

Karl Smith

what may be causing this. I need help tracking down the cause.
=20
Well, what did change in your system environment in the last few days?
=20
Looking at the crash log, and considering the crash happens after an
SQL statement: What database are you using, and what is its version?
And what's your Rails/ActiveRecord version?
=20
If possible, build an application with the *minimum* set of external
libraries that still produces a crash.

Since I am not doing anything unusual (using common gems and typical =
methods for ruby/gem installation), I would expect to see others report =
the same issue. I have deleted and re-installed 1.9.2-p180 several =
times, tried reverting to 1.9.2-p136, and erased and re-installed all =
gems. Still keeps on crashing.

Not 100% sure what has changed. I did update Postgres to 9.0.4 via brew, =
so the pg gem would have been compiled against the new version. But =
again, I would expect others who have done the same to report issues.

The crashing is common, but not consistent. For example, it took 4 times =
running 'rake -T' before it would finally work. But eventually it did =
work.

Because of it's inconsistency, could this is a threading or timing issue =
with the pg gem?
 
C

Chris Kottom

[Note: parts of this message were removed to make it a legal post.]

So for what it's worth, the particular issue I was running into was not
caused by Classifier at all but rather by the test data I was using. The
application this is for is still under development, so the texts that I'm
indexing are multiple-paragraph blocks being generated using Faker::Lorem.
The problem here is that this library has a limited vocabulary of less than
200 words, and Classifier::LSI requires that the number of unique words
being indexed across all texts must be greater than or equal to the number
of text instances. (It seems like it was also filtering out some number of
words -- probably one- and two-character words which might be considered
stop words.) So as soon as the number of records indexed exceeded the
number of unique words, the underlying library (GNU GSL) propagated an
exception.

I've now tested this against a set of strings utilizing a richer vocabulary,
and even though indexing slows down exponentially with greater numbers of
records, it completes successfully. Hope this description helps someone
else out.
 
R

Ryan Davis

Since I am not doing anything unusual (using common gems and typical =
methods for ruby/gem installation)

Well... you're calling into ruby_parser in a rails app, which I think is =
a tad unusual... Still looking to get real information as to why.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top