R
Ruby Quiz
This is a cool little data mining problem. Unfortunately, it is a fair bit of
work to get a good solution. Justin Either sent in the only full solution and
even he chose to cut a few corners. His code does produce almost 200 lines of
interesting feedback when run on the sample data set though, so it's fun to play
around with. The downside is that it's close to 300 lines of code. Because of
that, I'm going to try giving a pretty high-level overview of how it works.
Let's work backwards this time and just follow the code:
# ...
if ARGV.size < 1
puts "Usage: vehicle_counter.rb datafile [-avg]"
else
vc = VehicleCounter.new
vc.process(ARGV[0], ARGV[1] != nil)
end
This is the code that actually gets executed when you run the program. I always
call that the application code.
Here we see a simple branch that either presents a usage message or builds a
VehicleCounter and calls process() on it. This could actually be just a module,
since it doesn't maintain any state. It may help you to think of the following
code that way:
# ...
class VehicleCounter
# ...
def process(file, report_averages)
raw_data = parse(file)
for sensor in 0..1
sensor_report = DirectionDataReport.new(raw_data[sensor])
sensor_report.report(sensor, report_averages)
end
end
# ...
end
# ...
This defines the overall process of this program which is just two steps: read
in the data and build a report. The parse() method manages the reading step
while reporting is encapsulated in DirectionDataReport objects. We will begin
with the reading:
# ...
Dir_A = 0
Dir_B = 1
def parse(file)
times = []
dirs = [[], []]
f = File.open(file)
f.each_line do |line|
sensor, time = parse_record(line)
times << time
if (times.size % 2) == 0
if (times.size == 2 and sensor == Dir_A) or
(times.size == 4 and sensor == Dir_B)
# Remove "B" records from second direction
times = [times[0], times[2]] if sensor == Dir_B
dirs[sensor] << times
times = []
elsif (times.size == 4 and sensor != Dir_B)
puts "Parse error"
times = []
end
elsif (times.size % 2) == 1 and sensor == Dir_B
puts "Parse error - Unexpected B record"
end
end
f.close
dirs
end
def parse_record(data)
unpacked = data.unpack("a1a*")
return unpacked[0] == "A" ? Dir_A : Dir_B, unpacked[1].to_i
end
# ...
This parser is what I meant by cutting corners earlier. The data provided is
quite simple and you don't have to worry about cars going both directions at the
same time when working with it. Given that, a pair of A times is a northbound
car and a set of A, B, A, B times is a southbound car. This parse just hunts
for those sets.
Both sets of times are maintained in the Array of Arrays called dirs. The
constants are used to select the right group based on the indicator found in
unpack()ing the last record of the group. These two Arrays are the ultimate
results of this method.
Note that the file reading loop could be reduced to a File.foreach() call.
The rest of the code is reporting. We saw a DirectionDataReport initialized in
the process() method and then witnessed a call to report() on that object.
These two steps are the primary interface:
MSecPerMin = 1000 * 60
MSecPerHour = MSecPerMin * 60
InchesPerMile = 63360
class DirectionDataReport
Verbose = false
Sensors = %w(Northbound Southbound)
Days = %w(Mon Tue Wed Thu Fri Sat Sun)
def initialize(raw_data)
@raw_data = raw_data
end
def report(sensor, report_averages)
puts "Direction: #{Sensors[sensor]}"
puts "Total Cars: #{self.total_count}"
report_time_periods(sensor, report_averages, MSecPerHour * 12)
report_time_periods(sensor, report_averages, MSecPerHour)
report_time_periods(sensor, report_averages, MSecPerHour / 2)
report_time_periods(sensor, report_averages, MSecPerHour / 4)
puts ""
end
def total_count()
@raw_data.size
end
# ...
You can see here that the constructor just squirrels away the data from the
parse. The report() method, on the other, hand drives the output process.
After reporting the direction and a total car count, it displays time oriented
breakdowns for various scales: 12 hours, one hour, 30 minutes, and 15 minutes.
Here is the report_time_periods() method and three of the methods it relies on:
# ...
def report_time_periods(sensor, report_averages, time_period_length)
days = create_time_periods(time_period_length)
num_time_periods = (MSecPerHour * 24) / time_period_length
counts = count_per_time_period(days)
avg_speeds = speed_per_time_period(days)
avg_dists = dist_per_time_period(days)
puts("\nTime Interval: #{time_period_length/MSecPerMin} Minutes")
if (num_time_periods > 2)
peaks = find_peak_times(days)
puts("Peak Times")
for i in 0...peaks.size
printf("#{Days}:")
peaks.size.times {|p|
printf(format_time_interval_index(peaks[p][1],
time_period_length))
}
puts ""
end
end
puts("Statistics")
printf("Data ")
printf("\tDay") if not report_averages
num_time_periods.times{|i|
printf(format_time_interval_index(i, time_period_length))
}
report_column_data(days, num_time_periods, report_averages, counts,
report_averages ? "Avg Count" : "Count ", "% 5d")
report_column_data(days, num_time_periods, report_averages, avg_speeds,
"Avg Speed", "%02.02f")
report_column_data(days, num_time_periods, report_averages, avg_dists,
"Avg Dist ", "%02.02f")
puts ""
end
def create_time_periods(time_period_length = MSecPerHour)
days = []
time_periods = nil
prev_start = @raw_data[0][0] + 1
for data in @raw_data
if prev_start > data[0]
days << time_periods if time_periods != nil
cur_time_period = 0
time_periods = [[]]
puts "New day: data=#{data[0]}" if Verbose
elsif data[0] > ((cur_time_period + 1) * time_period_length)
while data[0] > ((cur_time_period + 1) * time_period_length)
cur_time_period += 1
time_periods[cur_time_period] = []
end
puts "New time period: data=#{data[0]}" if Verbose
end
time_periods[cur_time_period] << data
prev_start = data[0]
end
days << time_periods
days
end
def format_time_interval_index(index, time_period_length)
sprintf("\t%02d:%02d",
index * time_period_length / MSecPerHour,
(index * time_period_length / MSecPerMin) % 60)
end
def report_column_data(days, num_time_periods, report_averages, data,
data_label, format_string)
if report_averages
printf("\n#{data_label}")
for time in 0...num_time_periods
avg = 0
days.size.times {|day| avg += data[day][time] }
printf("\t#{format_string}", avg / days.size)
end
else
for day in 0...days.size
printf("\n#{data_label}\t#{Days[day]}")
for time in 0...num_time_periods
printf("\t#{format_string}", data[day][time])
end
end
end
end
# ...
There is a lot of code here obviously, but none of it is very complex. First,
report_time_periods() breaks the data down into time intervals using
create_time_periods(). That method is just a partition tool for the data. Once
it's divided out, a few utility method we will examine next are used to separate
the interesting information out.
After that, the entire rest of the of the method is printing code for what it
found. (The peak data section actually calculates, again using another helper,
and prints for appropriate data ranges.) Most of the print work is delegated to
the bottom two methods and much or their magic is for printing columnar data.
Here are the helpers I glossed over and the helpers they rely on:
# ...
def find_peak_times(days, num_peaks=4)
days.map do |day|
find_daily_peak_times(day, num_peaks)
end
end
def find_daily_peak_times(daily_time_periods, num_peaks)
peaks = []
daily_time_periods.size.times {|i|
peaks << [daily_time_periods.size, i]
}
peaks.sort.reverse.slice(0, num_peaks).sort {|a,b| a[1]<=>b[1]}
end
def count_per_time_period(days)
days.map do |day|
day.map {|time_period| time_period.size}
end
end
def speed_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_speed(time_period) }
end
end
def dist_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_distance(time_period) }
end
end
def calc_average_speed(time_period)
return 0 if time_period.size == 0
sum = 0
for time in time_period
sum += calc_speed(time[0], time[1])
end
sum / (time_period.size)
end
def calc_speed(start_time, end_time)
return (100.0 / (end_time - start_time)) *
MSecPerHour / (InchesPerMile * 1.0)
end
def calc_average_distance(time_period)
return 0 if time_period.size <= 1 # Need at least 2 cars
sum = 0
for i in 0...(time_period.size - 1)
sum += calc_distance(time_period[0], time_period[1])
end
return sum / (time_period.size - 1)
end
def calc_distance(leader_time, follower_time)
follower_speed = calc_speed(follower_time[0], follower_time[1])
dist = follower_speed * ((follower_time[0] - leader_time[0]) /
(MSecPerHour * 1.0))
return dist
end
end
# ...
There are three kinds of helpers here. The peak methods just sort the data and
take so many entries off the top. The *_per_time_period() method are wrappers
that apply the calc_*() methods over periods of data. Which means the calc_*()
methods are where the action is. They figure out car speeds, distances between
cars, and the averages for both using simple math. This is the primary data
that we see in the report output.
Again it's a lot of code to cover, but Justin has produced a great end result.
I'm not going to include the output here because it too is large, but go run the
program and take a look at what it builds.
My thanks to Justin for the heroic effort and Gavin for the interesting problem.
The Ruby Quiz began with an encoding problem that came out of a book and
tomorrow we have another one for you...
work to get a good solution. Justin Either sent in the only full solution and
even he chose to cut a few corners. His code does produce almost 200 lines of
interesting feedback when run on the sample data set though, so it's fun to play
around with. The downside is that it's close to 300 lines of code. Because of
that, I'm going to try giving a pretty high-level overview of how it works.
Let's work backwards this time and just follow the code:
# ...
if ARGV.size < 1
puts "Usage: vehicle_counter.rb datafile [-avg]"
else
vc = VehicleCounter.new
vc.process(ARGV[0], ARGV[1] != nil)
end
This is the code that actually gets executed when you run the program. I always
call that the application code.
Here we see a simple branch that either presents a usage message or builds a
VehicleCounter and calls process() on it. This could actually be just a module,
since it doesn't maintain any state. It may help you to think of the following
code that way:
# ...
class VehicleCounter
# ...
def process(file, report_averages)
raw_data = parse(file)
for sensor in 0..1
sensor_report = DirectionDataReport.new(raw_data[sensor])
sensor_report.report(sensor, report_averages)
end
end
# ...
end
# ...
This defines the overall process of this program which is just two steps: read
in the data and build a report. The parse() method manages the reading step
while reporting is encapsulated in DirectionDataReport objects. We will begin
with the reading:
# ...
Dir_A = 0
Dir_B = 1
def parse(file)
times = []
dirs = [[], []]
f = File.open(file)
f.each_line do |line|
sensor, time = parse_record(line)
times << time
if (times.size % 2) == 0
if (times.size == 2 and sensor == Dir_A) or
(times.size == 4 and sensor == Dir_B)
# Remove "B" records from second direction
times = [times[0], times[2]] if sensor == Dir_B
dirs[sensor] << times
times = []
elsif (times.size == 4 and sensor != Dir_B)
puts "Parse error"
times = []
end
elsif (times.size % 2) == 1 and sensor == Dir_B
puts "Parse error - Unexpected B record"
end
end
f.close
dirs
end
def parse_record(data)
unpacked = data.unpack("a1a*")
return unpacked[0] == "A" ? Dir_A : Dir_B, unpacked[1].to_i
end
# ...
This parser is what I meant by cutting corners earlier. The data provided is
quite simple and you don't have to worry about cars going both directions at the
same time when working with it. Given that, a pair of A times is a northbound
car and a set of A, B, A, B times is a southbound car. This parse just hunts
for those sets.
Both sets of times are maintained in the Array of Arrays called dirs. The
constants are used to select the right group based on the indicator found in
unpack()ing the last record of the group. These two Arrays are the ultimate
results of this method.
Note that the file reading loop could be reduced to a File.foreach() call.
The rest of the code is reporting. We saw a DirectionDataReport initialized in
the process() method and then witnessed a call to report() on that object.
These two steps are the primary interface:
MSecPerMin = 1000 * 60
MSecPerHour = MSecPerMin * 60
InchesPerMile = 63360
class DirectionDataReport
Verbose = false
Sensors = %w(Northbound Southbound)
Days = %w(Mon Tue Wed Thu Fri Sat Sun)
def initialize(raw_data)
@raw_data = raw_data
end
def report(sensor, report_averages)
puts "Direction: #{Sensors[sensor]}"
puts "Total Cars: #{self.total_count}"
report_time_periods(sensor, report_averages, MSecPerHour * 12)
report_time_periods(sensor, report_averages, MSecPerHour)
report_time_periods(sensor, report_averages, MSecPerHour / 2)
report_time_periods(sensor, report_averages, MSecPerHour / 4)
puts ""
end
def total_count()
@raw_data.size
end
# ...
You can see here that the constructor just squirrels away the data from the
parse. The report() method, on the other, hand drives the output process.
After reporting the direction and a total car count, it displays time oriented
breakdowns for various scales: 12 hours, one hour, 30 minutes, and 15 minutes.
Here is the report_time_periods() method and three of the methods it relies on:
# ...
def report_time_periods(sensor, report_averages, time_period_length)
days = create_time_periods(time_period_length)
num_time_periods = (MSecPerHour * 24) / time_period_length
counts = count_per_time_period(days)
avg_speeds = speed_per_time_period(days)
avg_dists = dist_per_time_period(days)
puts("\nTime Interval: #{time_period_length/MSecPerMin} Minutes")
if (num_time_periods > 2)
peaks = find_peak_times(days)
puts("Peak Times")
for i in 0...peaks.size
printf("#{Days}:")
peaks.size.times {|p|
printf(format_time_interval_index(peaks[p][1],
time_period_length))
}
puts ""
end
end
puts("Statistics")
printf("Data ")
printf("\tDay") if not report_averages
num_time_periods.times{|i|
printf(format_time_interval_index(i, time_period_length))
}
report_column_data(days, num_time_periods, report_averages, counts,
report_averages ? "Avg Count" : "Count ", "% 5d")
report_column_data(days, num_time_periods, report_averages, avg_speeds,
"Avg Speed", "%02.02f")
report_column_data(days, num_time_periods, report_averages, avg_dists,
"Avg Dist ", "%02.02f")
puts ""
end
def create_time_periods(time_period_length = MSecPerHour)
days = []
time_periods = nil
prev_start = @raw_data[0][0] + 1
for data in @raw_data
if prev_start > data[0]
days << time_periods if time_periods != nil
cur_time_period = 0
time_periods = [[]]
puts "New day: data=#{data[0]}" if Verbose
elsif data[0] > ((cur_time_period + 1) * time_period_length)
while data[0] > ((cur_time_period + 1) * time_period_length)
cur_time_period += 1
time_periods[cur_time_period] = []
end
puts "New time period: data=#{data[0]}" if Verbose
end
time_periods[cur_time_period] << data
prev_start = data[0]
end
days << time_periods
days
end
def format_time_interval_index(index, time_period_length)
sprintf("\t%02d:%02d",
index * time_period_length / MSecPerHour,
(index * time_period_length / MSecPerMin) % 60)
end
def report_column_data(days, num_time_periods, report_averages, data,
data_label, format_string)
if report_averages
printf("\n#{data_label}")
for time in 0...num_time_periods
avg = 0
days.size.times {|day| avg += data[day][time] }
printf("\t#{format_string}", avg / days.size)
end
else
for day in 0...days.size
printf("\n#{data_label}\t#{Days[day]}")
for time in 0...num_time_periods
printf("\t#{format_string}", data[day][time])
end
end
end
end
# ...
There is a lot of code here obviously, but none of it is very complex. First,
report_time_periods() breaks the data down into time intervals using
create_time_periods(). That method is just a partition tool for the data. Once
it's divided out, a few utility method we will examine next are used to separate
the interesting information out.
After that, the entire rest of the of the method is printing code for what it
found. (The peak data section actually calculates, again using another helper,
and prints for appropriate data ranges.) Most of the print work is delegated to
the bottom two methods and much or their magic is for printing columnar data.
Here are the helpers I glossed over and the helpers they rely on:
# ...
def find_peak_times(days, num_peaks=4)
days.map do |day|
find_daily_peak_times(day, num_peaks)
end
end
def find_daily_peak_times(daily_time_periods, num_peaks)
peaks = []
daily_time_periods.size.times {|i|
peaks << [daily_time_periods.size, i]
}
peaks.sort.reverse.slice(0, num_peaks).sort {|a,b| a[1]<=>b[1]}
end
def count_per_time_period(days)
days.map do |day|
day.map {|time_period| time_period.size}
end
end
def speed_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_speed(time_period) }
end
end
def dist_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_distance(time_period) }
end
end
def calc_average_speed(time_period)
return 0 if time_period.size == 0
sum = 0
for time in time_period
sum += calc_speed(time[0], time[1])
end
sum / (time_period.size)
end
def calc_speed(start_time, end_time)
return (100.0 / (end_time - start_time)) *
MSecPerHour / (InchesPerMile * 1.0)
end
def calc_average_distance(time_period)
return 0 if time_period.size <= 1 # Need at least 2 cars
sum = 0
for i in 0...(time_period.size - 1)
sum += calc_distance(time_period[0], time_period[1])
end
return sum / (time_period.size - 1)
end
def calc_distance(leader_time, follower_time)
follower_speed = calc_speed(follower_time[0], follower_time[1])
dist = follower_speed * ((follower_time[0] - leader_time[0]) /
(MSecPerHour * 1.0))
return dist
end
end
# ...
There are three kinds of helpers here. The peak methods just sort the data and
take so many entries off the top. The *_per_time_period() method are wrappers
that apply the calc_*() methods over periods of data. Which means the calc_*()
methods are where the action is. They figure out car speeds, distances between
cars, and the averages for both using simple math. This is the primary data
that we see in the report output.
Again it's a lot of code to cover, but Justin has produced a great end result.
I'm not going to include the output here because it too is large, but go run the
program and take a look at what it builds.
My thanks to Justin for the heroic effort and Gavin for the interesting problem.
The Ruby Quiz began with an encoding problem that came out of a book and
tomorrow we have another one for you...