[QUIZ] Index and Query (#54)

D

David Balmain

Hey Guys,
I've written up the benchmark results. Don't take them too seriously.
I didn't exactly run them very scientifically. It just gives you a
ball park figure for how well each indexer performs against another.
Hopefully you'll be able to use it to help improve the performance of
your ruby code in future.

http://www.davebalmain.com/articles/2005/11/15/ruby-quiz-54-results

There were certainly a few surprises in there for me. Like how did Bob
and Dale manage to index so quickly without using the StringScanner
class. And why is Bob's search so fast, even though he didn't use an
inverted index. Even for much larger document sets his search runs
almost the same speed as Simple Ferret. He used exactly the same
algorithm as is used in Simple Search, yet his searches are much
quicker. I was also amazed at how well these simple solutions compared
to Lucene, although I realize I'm comparing apples with oranges.

Cheers,
Dave
 
D

Dale Martenson

David,

After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.

If possible, could you run this version through your test to see how it
does?

class IndexHash
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end

def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end

def insert( document_symbol, word )
w = word.downcase
@index[w] += [ document_symbol ] unless @index[w].include?(
document_symbol )
end

def find( *strings )
result = []
strings.each do |string|
string.split.each do |word|
result += @index[ word.downcase ]
end
end
result.uniq
end

def words
@index.keys.sort
end
end

class IndexBitmap
def initialize( documents=nil )
@index = []
@documents = Hash.new( 0 )
input( documents ) if documents
end

def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end

def insert( document_symbol, word )
w = word.downcase
@index.push( w ) unless @index.include?( w )
@documents[ document_symbol ] |= (1<<@index.index( w ))
end

def find( *strings )
result = []
mask = 0

strings.each do |string|
string.split.each do |word|
w = word.downcase
mask |= (1<<@index.index(w)) if @index.index(w)
end
end

@documents.each_pair do |symbol, value|
result.push( symbol ) if value & mask
end
result
end

def words
@index.sort
end
end
 
D

David Balmain

David,

After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.

If possible, could you run this version through your test to see how it
does?

Hi Dale,
I've updated the results. You may notice a number of the indexing
times have changed. I modified my test to make it a little fairer.
Anyway, you earned yourself some green realestate on the bottom chart.
Somethings up with the search results from your first bitmap index.
I'm just running that one again. By the way, I had to change the
second last line of your find method to;

result.push( symbol ) if value & mask > 0

Cheers,
Dave
class IndexHash
def initialize( documents=3Dnil )
@index =3D Hash.new( [] )
input( documents ) if documents
end

def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word= ) }
end
end

def insert( document_symbol, word )
w =3D word.downcase
@index[w] +=3D [ document_symbol ] unless @index[w].inclu= de?(
document_symbol )
end

def find( *strings )
result =3D []
strings.each do |string|
string.split.each do |word|
result +=3D @index[ word.downcase ]
end
end
result.uniq
end

def words
@index.keys.sort
end
end

class IndexBitmap
def initialize( documents=3Dnil )
@index =3D []
@documents =3D Hash.new( 0 )
input( documents ) if documents
end

def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word= ) }
end
end

def insert( document_symbol, word )
w =3D word.downcase
@index.push( w ) unless @index.include?( w )
@documents[ document_symbol ] |=3D (1<<@index.index( w ))
end

def find( *strings )
result =3D []
mask =3D 0

strings.each do |string|
string.split.each do |word|
w =3D word.downcase
mask |=3D (1<<@index.index(w)) if @index.= index(w)
end
end

@documents.each_pair do |symbol, value|
result.push( symbol ) if value & mask
end
result
end

def words
@index.sort
end
end
 
D

Dale Martenson

Hi Dave,

Thanks for updating the results.

My first version of the IndexBitmap class has a bug in the 'insert'
routine. It added values rather than ORed values together. Oops. None
of my tests originally covered this case so it slipped by me.

@documents[ document_symbol ] += (1<<@index.index( word.downcase ))

should be

@documents[ document_symbol ] |= (1<<@index.index( word.downcase ))

Thanks for catching my error where I wasn't checking the mask correctly
in all cases. Again, I lacked a test for this case. Shame. Shame.

Thank you very much for your testing and analysis.

--Dale
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,197
Messages
2,571,038
Members
47,633
Latest member
BriannaLyk

Latest Threads

Top