building ruby for speed: wise or otherwise?

Z

zdennis

Hugh said:
At the moment my script to populate the tables is taking about an
hour. Anyway it's mostly ruby I think, because it spends most of
the time setting up the arrays before it populates the db with them.

I had similar problems with ActiveRecord and large datasets. Its slow. I wrote a active record
extension (haven't released yet, I am trying to figure out how-to best release....as plugin, or as
patch to rubyonrails dev team). It makes large dataset entry 10 times faster. If you like I can
email you privately the code and see if it helps you.

Zach
 
H

Hugh Sasse

I had similar problems with ActiveRecord and large datasets. Its slow. I wrote
a active record extension (haven't released yet, I am trying to figure out
how-to best release....as plugin, or as patch to rubyonrails dev team). It
makes large dataset entry 10 times faster. If you like I can email you
privately the code and see if it helps you.

How big is it? I wondering how much there is to learn since I'm
still getting to grips with all this Rails stuff :). I am
interested though. Thank you.
Hugh
 
E

Eric Christensen

vjoel said:
I'm embarrassed to say it was 6.0. I have 8.0 (express) but can't get
past the "MSVCR80.DLL missing" problem, at least with the
mkmf.rb-generated Makefile. (For regular projects in MSVC 8.0, you can
get around this problem by deleting the foobar.exe.embed.manifest.res
file from the Debug dir of project foobar.) Anyone have any ideas?

I'm using the single-click installer ruby, which IIRC is compiled with
7.1. Maybe it's not a fair comparison with gcc-built ruby, since that
will take advatage of i686 vs. i386. So, not a very scientific
comparison at all--it would best to use the latest MS compiler, build
ruby from scratch, and make sure to use the same arch settings as for
gcc.

I'm just glad to see that gcc is so much better than it was.

I'd be very interested in seeing it compiled with the just-released VC
using LTCG (link-time code generation). It can make inter-module
optimizations and adjust calling conventions on a case-by-case basis.
 
J

Joel VanderWerf

Eric said:
I'd be very interested in seeing it compiled with the just-released VC
using LTCG (link-time code generation). It can make inter-module
optimizations and adjust calling conventions on a case-by-case basis.

I'd love to try it. Any idea how to hack the Makefiles generated by
mkmf.rb to fix the "MSVCR80.DLL missing" problem? Can you compile any
ruby extension successfully with 8.0?

Do you know that LTCG is included in the Express version? ISTR that some
optimization features (maybe profile guided optimization) were not in
the express edition. However, the Property Page for Linker/Optimization
seems to allow setting LTCG and PGO.
 
H

Hugh Sasse

Well that was an interesting estimate. So far it has been 29 hours
to profile it...

Hugh
 
E

Eric Christensen

vjoel said:
I'd love to try it. Any idea how to hack the Makefiles generated by
mkmf.rb to fix the "MSVCR80.DLL missing" problem? Can you compile any
ruby extension successfully with 8.0?

Do you know that LTCG is included in the Express version? ISTR that some
optimization features (maybe profile guided optimization) were not in
the express edition. However, the Property Page for Linker/Optimization
seems to allow setting LTCG and PGO.

I'm *very* new to Ruby, so I haven't really tried anything yet. But I
just got a copy of VS 2005 Pro, so I'm itchin' to try. Now if I could
just find some spare time...

Do you know if there is a benchmark that would give an idea of the
relative speed of the interpreter over a representative workload, or
could perhaps be used as the scenario for tthe profile-guided
optimization?

One more question: is there a 64-bit Ruby?
 
K

Kaspar Schiess

(In response to by Robert Klemme)
Unfortunately we're close to release and I don't really have much time to
look into this deeper. If anyone else volunteers...

Please Hugh, post the script or at least parts of it. I have been doing
lots of database filling recently and might be able to give you a few
pointers. Here's the general ones:

Profiling is often no use, Benchmarking might just be enough. Roughly
knowing where time goes to is a big help in guiding optimization.

Be sure to try running it with a small data set, to speed up the test-
change-test cycle (or whatever the cycle is called in languages that you
don't compile). Maybe even profile that way.

Do you have a generation stage followed by a fill stage ? Or is the
computation intermingled with database accesses ?

hope to be of help
k
 
H

Hugh Sasse

(In response to by Robert Klemme)


Please Hugh, post the script or at least parts of it. I have been doing
lots of database filling recently and might be able to give you a few
pointers. Here's the general ones:

I don't think it is particularly pretty. TableMaker used to
generate SQL directly, now it uses AR instead, so the output file is
unused. I've tried to clear out the other unused stuff that is no
use to you but I may still need.

Thank you
Hugh

#!/usr/local/bin/ruby -w

$: << '/home/hgs/aeg_intranet/csestore/app/models'

# require 'csv'
require 'set'
require 'open-uri'
require 'net/http'
require 'date'
require 'md5'
# require 'hashattr'
require 'fasthashequals'

require "rubygems"
require_gem "activerecord" # for the ORM.

# Makes no sense to include these before active_record
# These are just (almost empty) models from rails. There are some
# relationship definitions (has_a, etc) but that's about it.
require 'student'
require 'cse_module'
require 'device'

$debug = false


# Class for creating the database tables from the supplied input
class TableMaker
attr_accessor :students, :cse_modules
INPUT = "hugh.csv"
OUTPUT = "populate_tables.sql"

ACCEPTED_MODULES = /^\"TECH(100[1-7]|200\d|201[01]|300\d|301[0-2])|MUST100[28]/

STRFTIME_FORMAT = "%a, %d %b %Y %H:%M:%S GMT"

PATH_TO_IMAGES = 'Z:\\new\\jpegs\\'

# Read in the database and populate the tables.
def initialize(input=INPUT, output=OUTPUT)
begin
puts "TableMaker.initialize (input=#{input.inspect}, output=#{output.inspect}"
# check these agree
# Struct.new( "Student", :forename, :surname, :birth_dt,
# :picture, :coll_status)
# Struct.new("Ident", :student, :pnumber)

# Struct.new("CourseModule", :aos_code, :dept_code,
# :aos_type, :full_desc)

# Struct.new("StudentModule", :student_id, :course_module)

@students = Set.new()
@cse_modules = Set.new()
@student_modules = Hash.new{Set.new()}
# Most images will be written in bulk so cache them
@web_timestamps = Hash.new()
# Initialize variables
forename, surname, birth_dt, pnumber, aos_code,
acad_period, stage_ind, dept_code, stage_code, aos_type,
picture, coll_status, full_desc = [nil] * 13

student, cse_module, ident = nil, nil, nil
record = nil

last_pnumber, last_aos_code = nil, nil
last_student, last_cse_module = nil, nil

open(input, 'r') do |infp|
while record = infp.gets
# record.strip!
puts "record is #{record}" if $debug
# Don't split off the rest till we need it.
# Hopefully splitting on strings is faster.
forename, surname, birth_dt,
pnumber, aos_code, the_rest = record.split(/\s*\|\s*/,6)

next unless aos_code =~ ACCEPTED_MODULES

forename, surname, birth_dt, pnumber, aos_code,
acad_period, stage_ind, dept_code, stage_code, aos_type,
picture, coll_status, full_desc = record.split(/\s*\|\s*/)


puts "from record, picture is [#{picture.inspect}]." if $debug

if pnumber == last_pnumber
student = last_student
puts "pnumber set to last_pnumber" if $debug
else
# Structures for student
student = Student.new(
:forename => forename,
:surname => surname,
:birth_dt => birth_dt,
:pnumber => pnumber,
:picture => picture,
:coll_status => coll_status
)

# Avoid duplicates
# unless @students.include? student
@students.add student
# else
# puts "Already seen #{student}" if $debug
# end
last_pnumber = pnumber
last_student = student

end


# Structures for module data.
if aos_code == last_aos_code
this_cse_module = last_cse_module
else
this_cse_module = CseModule.new(
:aos_code => aos_code,
:dept_code => dept_code,
:aos_type => aos_type,
:full_desc => full_desc
)
end

# Avoid duplicates
@cse_modules.add this_cse_module
last_cse_module = this_cse_module
@student_modules[student].add this_cse_module

puts "cse_module is #{this_cse_module}" if $debug

end
end
rescue
puts "\n"
puts $!
puts $!.backtrace.join("\n")
end
end

def has_student?(given_student)
result = @students.member?(given_student)
puts "has_student?: @students.size is #{@students.size}, result is #{result}"
return result
end

def diff_students(other_table)
diff_students = @students - other_table.students
return Set.new(diff_students)
end

# The pnumber is a barcode that uniquely identifies a student.
def has_pnumber?(apnumber)
return @students.any? do |pn|
pn == apnumber
end
end

def new_pnumber(old_table)
new_pnumbers = @pnumbers.reject do |pn|
old_table.has_pnumber?(pn)
end
return Set.new(new_pnumbers)
end

# Convert the picture to a URI and get it, if necessary.
# moved out of make_cards to shorten that function.
def get_picture(pic_name)
pic = "#{pic_name}"
pic.gsub!(/\"/,'')
pic.gsub!(/ /, "%20")
url = pic.dup
puts "pic is #{pic.inspect}\nurl is #{url.inspect}" # if $debug
pic.sub!(/^.*\//,'')
puts "pic is now #{pic.inspect}" # if $debug
if pic.empty?
puts "No such picture " if $debug
elsif pic =~ /^Z:\\/i
puts "Already got this " if $debug
else
Dir.chdir("./images") do
begin
grab = true
url =~ /^http:\/\/([^:\/]+):?([^\/]*?)(.*)/
host, port, path = $1, $2, $3
port = 80 if port.nil? or port.empty?
puts "pic #{pic}:- host #{host} port #{port} path #{path} " #if $debug
Net::HTTP.start(host, port) do |http|
header = http.head(path)
lastmod = header['last-modified']
# timestamp = DateTime.strptime(lastmod, STRFTIME_FORMAT)
# timestamp = Time.new(DateTime.strptime(lastmod, STRFTIME_FORMAT))
lastmod ||=Time.now.to_s
timestamp = (@web_timestamps[lastmod] ||= Time.parse(lastmod))

if File.exist?(pic)
mtime = File.mtime(pic)
puts "mtime #{mtime} timestamp #{timestamp}" if $debug
if mtime > timestamp
puts "file is newer, skip." if $debug
grab = false
end
end
if grab
open(pic, "wb") do |image|
image.print http.get(path).body
end
end
end
rescue => e
puts e.inspect
puts "\n"
puts "#{$!}, #{e}"
puts $!.backtrace().join("\n")
end
end
end
return PATH_TO_IMAGES + pic + "\r\n"
end


# Output all the data necessary to create the id cards.
def make_cards(output,the_students = @students)
personal_fields = [:forename, :surname, :birth_dt, :pnumber]
open(output, "w") do |outf|
the_students.each do |student|
puts "student:- #{student} :" if $debug
outstring = personal_fields.collect do |message|
# Remove unwanted quotation marks
"#{student.send(message)}, ".gsub(/"/,'')
end.join('')
# We need to iterate in case a student has two ids
# Not any more -- we know they will look like two students.
# It doesn't matter.

outstring += get_picture(student.picture)
outf.print outstring
end
end
end

# Cannot update the database til the comparison is complete, so
# this code must be moved into here
def update_database
@students.each do |student|
puts "update_database(): pnumber is #{student.pnumber}"
begin
orig_student = Student.find:)first, :conditions => ["pnumber = ?",student.pnumber])
puts "update_database(): orig_student.pnumber is #{orig_student.pnumber}"
rescue Exception => e
puts "update_database(): exception is #{e}"
puts "\n"
puts $!
puts $!.backtrace.join("\n")
puts "\n"
orig_student = nil
end
if orig_student.nil? # i.e. nothing found
student.save!
else
orig_student.update_attributes(
:surname => student.surname,
:birth_dt => student.birth_dt,
:picture => student.picture,
:coll_status => student.coll_status
)
end
end
@cse_modules.each do |cse_module|
orig_cse_module = CseModule.find:)first, :conditions => ['aos_code = ?', cse_module.aos_code]) rescue nil
if orig_cse_module.nil?
cse_module.save!
else
orig_cse_module.update_attributes(
:dept_code => cse_module.dept_code,
:aos_type => cse_module.aos_type,
:full_desc => cse_module.full_desc
)
end
end
# This next line should sort out the join table.
@student_modules.each do |student, modules|
the_student = Student.find:)first, :conditions => ['pnumber = ?', student.pnumber])
modules.each do |cse_module|
the_cse_module = CseModule.find:)first, :conditons => ['aos_code = ?', cse_module.aos_code])
puts "update_database(): updating #{the_cse_module} with #{the_student}"
the_cse_module.students << the_student
end
end
end
end

class KitTableMaker

def initialize(input)
# create outside the block for speed.
name, serialno, barcode = [nil]*3
@kit = Set.new()
barcodes = Set.new()

open(input, 'r') do |infp|
while record = infp.gets
name, serialno, barcode = record.split(/\s*,\s*/,3)
if barcodes.member?(barcode)
puts "Duplicate barcode #{barcode}"
else
device = Device.new:)description => name,
:serialno => serialno,
:barcode => barcode)
barcodes.add(barcode)
@kit.add device
end
end
end
end


def update_database
@kit.each do |device|
begin
orig_kit = Device.find:)first, :conditions => ["barcode = ?", device.barcode])
rescue Exception => e
puts "Device::update_database: exception is #{e}"
puts "\n", $!, $!.backtrace.join("\n"), "\n"
end
if orig_kit.nil?
device.save!
else
begin
orig_kit.update_attributes:)description => device.name,
:serialno => device.serialno,
:barcode => device.barcode)
rescue Exception => e
puts "Device::update_database: exception is #{e}"
puts "\n", $!, $!.backtrace.join("\n"), "\n"
end
end
end
end
end


if __FILE__ == $0
begin
ActiveRecord::Base.establish_connection(
:adapter => 'mysql',
:host => 'localhost',
:port => 3608,
:database => 'csestore_development',
:username => 'hgs',
:password => 'post-it-to-ruby-talk?'
)
new_table = TableMaker.new("hugh.csv", "update_tables.sql")
new_table.update_database()
old_table = TableMaker.new("hugh.csv.old")

new_table.make_cards("cards.out")
new_table.make_cards("new_cards.out", new_table.diff_students(old_table))
rescue Exception => e
puts "\n"
puts "#{$!}, #{e}"
puts $!.backtrace().join("\n")
end

device_table = KitTableMaker.new("stock1.csv")
device_table.update_database()
end
 
K

Kaspar Schiess

Hello Hugh,

I'd propose modifying your main logic as like follows:
require 'benchmark'
include Benchmark

puts measure { new_table = TableMaker.new("hugh.csv",
"update_tables.sql") }
puts measure { new_table.update_database() }
puts measure { old_table = TableMaker.new("hugh.csv.old") }

puts measure { new_table.make_cards("cards.out") }
puts measure { new_table.make_cards("new_cards.out",
new_table.diff_students(old_table)) }

and then running it with a reduced test set. That should give you a hint
as to where time is spent. I have read the code you posted, but cannot
find a performance hog in it. Perhaps you meant to say 'huge.csv' instead
of 'hugh.csv' ? How many students are there ? How many courses ? How many
average courses per student ?

Also, I assume you know that fetching the image files from http can
potentially be very slow. To speed that up, you could parallelize the
process by using a queue, a few workers and a stub image that you can
return.

Or you can of course just wait for the machine ;) .. Too bad Moores law
doesn't say that you actually get a new machine every 18 months, only
that it is available.

best greetings,
kaspar
 
E

Eric Christensen

Eric said:
I'm *very* new to Ruby, so I haven't really tried anything yet. But I
just got a copy of VS 2005 Pro, so I'm itchin' to try. Now if I could
just find some spare time...
I finally got miniruby built with VC 2005 & /GL: it seems to run about
10% faster than the 1.8.3 Windows drop.
 
J

Joel VanderWerf

Eric said:
I finally got miniruby built with VC 2005 & /GL: it seems to run about
10% faster than the 1.8.3 Windows drop.

What did you do to get miniruby to build?

What's preventing a full ruby build?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,200
Messages
2,571,046
Members
47,646
Latest member
xayaci5906

Latest Threads

Top