R
Ruby Quiz
The first thing to consider in this quiz is what does a Regexp to match a number
look like? Here's the most basic answer to match 1..12:
1|2|3|4|5|6|7|8|9|10|11|12
Note: You might want to reverse the order of that, unless you can count on your
anchoring to match the right thing.
Obviously, the above works and is dirt simple to implement. Here's a submitted
solution by Tanaka Akira that does pretty much that:
def Regexp.build(*args)
args = args.map {|arg| Array(arg) }.flatten.uniq.sort
neg, pos = args.partition {|arg| arg < 0 }
/ \A (?: -0*#{Regexp.union(*neg.map {|arg| (-arg).to_s })} |
0*#{Regexp.union(*pos.map {|arg| arg.to_s })} ) \z /x
end
The first line of that method is pretty clever, calling Array() on all the
passed arguments. That turns Range objects into the Array equivalent and wraps
simple Integers in an Array of their own. Following that up with flatten()
yields a single Array of all the elements we're trying to match.
The second line just separates the arguments into positive and negative groups.
Finally the third line builds a Regexp object from the created groups using the
nifty Regexp.union() that I wasn't even aware of when I made this quiz.
(This solution handles negative numbers and allows for arbitrary leading zeros.)
Is this quiz really this easy to solve? Obviously it can be, for simple data
sets. However, Tanaka's solution has limits. On my box, it only takes:
Regexp.build(1..10_000)
to get a "...regular expression too big..." error. Clearly, if your data set is
big you'll need to dig a little deeper.
That get's us back to our original question, but now with a qualification: What
is a short way to match a number with a Regexp? The most obvious optimization
to apply to our patterns is the use of character classes. Going back to our
1..12 example that might give us something like:
\d|1[0-2]
That's getting a lot more reasonable. Going to a serious example, even
1..1_000_000 is only:
[1-9]|[1-9]\d|[1-9]\d\d|[1-9]\d\d\d|[1-9]\d\d\d\d|[1-9]\d\d\d\d\d|1000000
Technically, we could keep going and get to something like:
[1-9]\d{0,5}|1000000
But that's a little trickier to build algorithmically and none of the submitted
solutions went quite that far.
The character class approach, on the other hand, was very popular.
The main trick to building character classes is to break down the passed Range
objects. You could also lump in the individual Integer arguments, but these are
pretty insignificant. Several solutions solved this by adding a method to the
Range class to convert them into Regexps.
Adding a Regexp.build() over that is trivial. Here's a nice example from Mark
Hubbart's second submission:
def Regexp.build(*args)
ranges, numbers = args.partition{|item| Range === item}
re = ranges.map{|r| r.to_re } + numbers.map{|n| /0*#{n}/ }
/^#{Regexp.union(*re)}$/
end
class Range
def to_re
# normalize the range format; we want end inclusive,
# integer ranges this part passes the load off to a
# newly built range if needed.
if exclude_end?
return( (first.to_i..last.to_i - 1).to_re )
elsif ! (first + last).kind_of?(Integer)
return( (first.to_i .. last.to_i).to_re )
end
# Deal with ranges that are wholly or partially negative.
# If range is only partially negative, split in half and
# get two regexen. join them together for the finish.
# If the range is wholly negative, make it positive, then
# add a negative sign to the regexp
if first < 0 and last < 0
# return a negatized version of the regexp
return /-#{(-last..-first).to_re}/
elsif first < 0
neg = (first..-1).to_re
pos = (0..last).to_re
return /(?:#{neg}|#{pos})/
end
### First, create an array of new ranges that are more
### suited to regex conversion.
# a and z will be the remainders of the endpoints
# of the range as we slice it
a, z = first, last
# build the first part of the list of new ranges.
list1 = []
num = first
until num > z
a = num # recycle the value
# get the first power of ten that leaves a remainder
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value up
num += v - num % v
# store the value unless it's too high
list1 << (a..num-1) unless num > z
end
# build the second part of the list; counting down.
list2 = []
num = last + 1
until num < a
z = num - 1 # recycle the value
# slice to the nearest power of ten
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value down
num -= num % v
# store the value if it fits
list2 << (num..z) unless num < a
end
# get the chewey center part, if needed
center = a < z ? [a..z] : []
# our new list
list = list1 + center + list2.reverse
### Next, convert each range to a regexp.
list.map! do |rng|
a, z = rng.first.to_s, rng.last.to_s
a.split(//).zip(z.split(//)).map do |(f,l)|
case
when f == l then f
when f.to_i + 1 == l.to_i then "[%s%s]" % [f,l]
when f+l == "09" then "\\d"
else
"[%s-%s]" % [f,l]
end
end.join # returns the regexp for *that* range
end
### Last, return the final regexp
/0*#{ list.join("|") }/
end
end
The first third of the to_re() method just deals with normalizing Ranges and is
very well commented.
The middle third divides the Range into Regexp friendly chunks which are groups
that share the same number of digits. For example, here is what to_re() builds
into the local "list" variable for 1..1_000:
1..9
10..99
100..999
1000..1000
The final third of to_re() builds character classes from these grouped Ranges.
The code inside "list.map! do ... end" is pretty clever and I recommend working
through it until you can follow how it works.
A big part of using these solutions is a question of how long you'll have to
wait for a Regexp object to be built and how quickly the result can find a
match. Here are some benchmarks, first for build times:
user system total real
James Edward Gray II 63.490000 0.020000 63.510000 ( 63.512904)
James Edward Gray II (2) 20.270000 0.000000 20.270000 ( 20.307565)
Jamis Buck 0.490000 0.000000 0.490000 ( 0.488664)
Mark Hubbart 0.290000 0.010000 0.300000 ( 0.297225)
Mark Hubbart (2) 0.030000 0.000000 0.030000 ( 0.028206)
Tanaka Akira 0.440000 0.000000 0.440000 ( 0.442430)
Thomas Leitner 0.020000 0.000000 0.020000 ( 0.013544)
Warren Brown 0.020000 0.000000 0.020000 ( 0.015618)
Or focusing in on the faster solutions over a bigger test:
user system total real
Mark Hubbart (2) 3.390000 0.000000 3.390000 ( 3.415515)
Thomas Leitner 2.120000 0.000000 2.120000 ( 2.138968)
Warren Brown 2.290000 0.000000 2.290000 ( 2.280555)
As you can see, Thomas Leitner and Warren Brown's solutions are also worth a
look, if you haven't checked them out already. Warren's even has a clever
feature to tell you which parameter of build() caused a match.
And here are some matching benchmarks (build times excluded from results):
user system total real
James Edward Gray II 0.070000 0.000000 0.070000 ( 0.079497)
James Edward Gray II (2) 0.100000 0.000000 0.100000 ( 0.115083)
Jamis Buck 5.480000 0.000000 5.480000 ( 5.502583)
Mark Hubbart 4.590000 0.000000 4.590000 ( 4.617739)
Mark Hubbart (2) 0.100000 0.010000 0.110000 ( 0.124735)
Tanaka Akira 4.570000 0.010000 4.580000 ( 4.741717)
Thomas Leitner 0.100000 0.000000 0.100000 ( 0.121431)
Warren Brown 0.130000 0.000000 0.130000 ( 0.123779)
As usual, my thanks go out to all who participated as well as to those who just
silently followed the discussion.
I've been unexpectedly called out of town this weekend, so we'll take a break
from Ruby Quiz. Sink all the effort you intended to give next week's quiz into
filling my inbox with quiz suggestions.
I'll send out a new quiz on the 29th that's all fun and games...
look like? Here's the most basic answer to match 1..12:
1|2|3|4|5|6|7|8|9|10|11|12
Note: You might want to reverse the order of that, unless you can count on your
anchoring to match the right thing.
Obviously, the above works and is dirt simple to implement. Here's a submitted
solution by Tanaka Akira that does pretty much that:
def Regexp.build(*args)
args = args.map {|arg| Array(arg) }.flatten.uniq.sort
neg, pos = args.partition {|arg| arg < 0 }
/ \A (?: -0*#{Regexp.union(*neg.map {|arg| (-arg).to_s })} |
0*#{Regexp.union(*pos.map {|arg| arg.to_s })} ) \z /x
end
The first line of that method is pretty clever, calling Array() on all the
passed arguments. That turns Range objects into the Array equivalent and wraps
simple Integers in an Array of their own. Following that up with flatten()
yields a single Array of all the elements we're trying to match.
The second line just separates the arguments into positive and negative groups.
Finally the third line builds a Regexp object from the created groups using the
nifty Regexp.union() that I wasn't even aware of when I made this quiz.
(This solution handles negative numbers and allows for arbitrary leading zeros.)
Is this quiz really this easy to solve? Obviously it can be, for simple data
sets. However, Tanaka's solution has limits. On my box, it only takes:
Regexp.build(1..10_000)
to get a "...regular expression too big..." error. Clearly, if your data set is
big you'll need to dig a little deeper.
That get's us back to our original question, but now with a qualification: What
is a short way to match a number with a Regexp? The most obvious optimization
to apply to our patterns is the use of character classes. Going back to our
1..12 example that might give us something like:
\d|1[0-2]
That's getting a lot more reasonable. Going to a serious example, even
1..1_000_000 is only:
[1-9]|[1-9]\d|[1-9]\d\d|[1-9]\d\d\d|[1-9]\d\d\d\d|[1-9]\d\d\d\d\d|1000000
Technically, we could keep going and get to something like:
[1-9]\d{0,5}|1000000
But that's a little trickier to build algorithmically and none of the submitted
solutions went quite that far.
The character class approach, on the other hand, was very popular.
The main trick to building character classes is to break down the passed Range
objects. You could also lump in the individual Integer arguments, but these are
pretty insignificant. Several solutions solved this by adding a method to the
Range class to convert them into Regexps.
Adding a Regexp.build() over that is trivial. Here's a nice example from Mark
Hubbart's second submission:
def Regexp.build(*args)
ranges, numbers = args.partition{|item| Range === item}
re = ranges.map{|r| r.to_re } + numbers.map{|n| /0*#{n}/ }
/^#{Regexp.union(*re)}$/
end
class Range
def to_re
# normalize the range format; we want end inclusive,
# integer ranges this part passes the load off to a
# newly built range if needed.
if exclude_end?
return( (first.to_i..last.to_i - 1).to_re )
elsif ! (first + last).kind_of?(Integer)
return( (first.to_i .. last.to_i).to_re )
end
# Deal with ranges that are wholly or partially negative.
# If range is only partially negative, split in half and
# get two regexen. join them together for the finish.
# If the range is wholly negative, make it positive, then
# add a negative sign to the regexp
if first < 0 and last < 0
# return a negatized version of the regexp
return /-#{(-last..-first).to_re}/
elsif first < 0
neg = (first..-1).to_re
pos = (0..last).to_re
return /(?:#{neg}|#{pos})/
end
### First, create an array of new ranges that are more
### suited to regex conversion.
# a and z will be the remainders of the endpoints
# of the range as we slice it
a, z = first, last
# build the first part of the list of new ranges.
list1 = []
num = first
until num > z
a = num # recycle the value
# get the first power of ten that leaves a remainder
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value up
num += v - num % v
# store the value unless it's too high
list1 << (a..num-1) unless num > z
end
# build the second part of the list; counting down.
list2 = []
num = last + 1
until num < a
z = num - 1 # recycle the value
# slice to the nearest power of ten
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value down
num -= num % v
# store the value if it fits
list2 << (num..z) unless num < a
end
# get the chewey center part, if needed
center = a < z ? [a..z] : []
# our new list
list = list1 + center + list2.reverse
### Next, convert each range to a regexp.
list.map! do |rng|
a, z = rng.first.to_s, rng.last.to_s
a.split(//).zip(z.split(//)).map do |(f,l)|
case
when f == l then f
when f.to_i + 1 == l.to_i then "[%s%s]" % [f,l]
when f+l == "09" then "\\d"
else
"[%s-%s]" % [f,l]
end
end.join # returns the regexp for *that* range
end
### Last, return the final regexp
/0*#{ list.join("|") }/
end
end
The first third of the to_re() method just deals with normalizing Ranges and is
very well commented.
The middle third divides the Range into Regexp friendly chunks which are groups
that share the same number of digits. For example, here is what to_re() builds
into the local "list" variable for 1..1_000:
1..9
10..99
100..999
1000..1000
The final third of to_re() builds character classes from these grouped Ranges.
The code inside "list.map! do ... end" is pretty clever and I recommend working
through it until you can follow how it works.
A big part of using these solutions is a question of how long you'll have to
wait for a Regexp object to be built and how quickly the result can find a
match. Here are some benchmarks, first for build times:
user system total real
James Edward Gray II 63.490000 0.020000 63.510000 ( 63.512904)
James Edward Gray II (2) 20.270000 0.000000 20.270000 ( 20.307565)
Jamis Buck 0.490000 0.000000 0.490000 ( 0.488664)
Mark Hubbart 0.290000 0.010000 0.300000 ( 0.297225)
Mark Hubbart (2) 0.030000 0.000000 0.030000 ( 0.028206)
Tanaka Akira 0.440000 0.000000 0.440000 ( 0.442430)
Thomas Leitner 0.020000 0.000000 0.020000 ( 0.013544)
Warren Brown 0.020000 0.000000 0.020000 ( 0.015618)
Or focusing in on the faster solutions over a bigger test:
user system total real
Mark Hubbart (2) 3.390000 0.000000 3.390000 ( 3.415515)
Thomas Leitner 2.120000 0.000000 2.120000 ( 2.138968)
Warren Brown 2.290000 0.000000 2.290000 ( 2.280555)
As you can see, Thomas Leitner and Warren Brown's solutions are also worth a
look, if you haven't checked them out already. Warren's even has a clever
feature to tell you which parameter of build() caused a match.
And here are some matching benchmarks (build times excluded from results):
user system total real
James Edward Gray II 0.070000 0.000000 0.070000 ( 0.079497)
James Edward Gray II (2) 0.100000 0.000000 0.100000 ( 0.115083)
Jamis Buck 5.480000 0.000000 5.480000 ( 5.502583)
Mark Hubbart 4.590000 0.000000 4.590000 ( 4.617739)
Mark Hubbart (2) 0.100000 0.010000 0.110000 ( 0.124735)
Tanaka Akira 4.570000 0.010000 4.580000 ( 4.741717)
Thomas Leitner 0.100000 0.000000 0.100000 ( 0.121431)
Warren Brown 0.130000 0.000000 0.130000 ( 0.123779)
As usual, my thanks go out to all who participated as well as to those who just
silently followed the discussion.
I've been unexpectedly called out of town this weekend, so we'll take a break
from Ruby Quiz. Sink all the effort you intended to give next week's quiz into
filling my inbox with quiz suggestions.
I'll send out a new quiz on the 29th that's all fun and games...