TumbleDRYer (#53) - my solution

D

Dale Martenson

I am not sure I understood the quiz this week. When I first read it, I
thought it might be that one should analyze the input, build a
dictionary, determine the repetition automatically and generate the
output using a simplified form. My first thought was to huffman encode
the dictionary based on entity frequency. The result was great and all
:) ... but not very readable.

After rereading the quiz, I took it to mean how would you apply the DRY
principle to generating this repetitive output. No programmatic analysis
required.

Here is a solution, using a 'little language' to represent another. My
idea was to set the defaults so that they reflected the most common
representation. That way only a few special cases were required.

Sorry, if I misunderstood the quiz.

--Dale

module SqlDsl
CT_DEFAULTS = {
:type => 'MyISAM',
:auto_increment => true,
:auto_increment_value => 3,
:auto_id => true
}

def ct( name, attributes={} )
a = CT_DEFAULTS.merge(attributes)
"CREATE TABLE `#{name}` (\n" +
"#{" `id` int(11) NOT NULL auto_increment,\n" if a[:auto_id]}" +
(( a[:auto_id] ) ? yield : yield.sub( /,\n$/, "\n" ) ) +
"#{" PRIMARY KEY (`id`)\n" if a[:auto_id]}" +
") TYPE=#{a[:type]}
#{"AUTO_INCREMENT=#{a[:auto_increment_value]}" if a[:auto_increment]} ;\n\n"
end

VC_DEFAULTS = {
:size => 50,
:null_allowed => false,
:default => true,
:default_value => ''
}

def vc( name, attributes={} )
a = VC_DEFAULTS.merge(attributes)
" `#{name}` varchar(#{a[:size]}) " + not_null(a) + default(a) +
",\n"
end

TEXT_DEFAULTS = {
:null_allowed => false
}

def text( name, attributes={} )
a = TEXT_DEFAULTS.merge(attributes)
" `#{name}` text " + not_null(a) + ",\n"
end

ID_DEFAULTS = {
:size => 11,
:null_allowed => false,
:default => true,
:default_value => '0'
}

def id( name, attributes={} )
a = ID_DEFAULTS.merge(attributes)
" `#{name}` int(#{a[:size]}) " + not_null(a) + default(a) + ",\n"
end

DATE_DEFAULTS = {
:null_allowed => false,
:default => true,
:default_value => '0000-00-00'
}

def date( name, attributes={} )
a = DATE_DEFAULTS.merge(attributes)
" `#{name}` date " + not_null(a) + default(a) + ",\n"
end

def pk( name )
" PRIMARY KEY (`#{name}`),\n"
end

def key( name, value )
" KEY `#{name}` (`#{value}`),\n"
end

def not_null( a )
"#{" NOT NULL" unless a[:null_allowed]}"
end

def default( a )
"#{" default '#{a[:default_value]}'" if a[:default]}"
end

def auto_increment( a )
"#{" auto_increment" if a[:auto_increment]}"
end
end

include SqlDsl

print ct( 'authors' ) {
vc( 'firstname' ) +
vc( 'name' ) +
vc( 'nickname' ) +
vc( 'contact' ) +
vc( 'password' ) +
text( 'description' )
} +
ct( 'categories' ) {
vc( 'name', :size=>20 ) +
vc( 'description', :size=>70 )
} +
ct( 'categories_documents', :auto_id=>false, :primary_key=>false,
:auto_increment=>false ) {
id( 'category_id' ) +
id( 'document_id' )
} +
ct( 'documents', :auto_id=>false, :auto_increment_value=>14 ) {
id( 'id', :auto_increment=>true, :default=>false ) +
vc( 'title' ) +
text( 'description' ) +
id( 'author_id' ) +
date( 'date' ) +
vc( 'filename' ) +
pk( 'id' ) +
key( 'document', 'title' )
}
 
H

Hugh Sasse

I am not sure I understood the quiz this week. When I first read it, I thought
it might be that one should analyze the input, build a dictionary, determine
the repetition automatically and generate the output using a simplified form.
My first thought was to huffman encode the dictionary based on entity
frequency. The result was great and all :) ... but not very readable.

After rereading the quiz, I took it to mean how would you apply the DRY
principle to generating this repetitive output. No programmatic analysis
required.

You were right the first time. I was wondering how much of this
could be automated, and how good the results could be. My solution
looked as below. Note: I know the user dialogue should not appear
in the output but this was just to get it working, no time to tidy
it up as I'd wish. So I'd be interested in seeing your Huffman
approach.

Hugh

#!/usr/local/bin/ruby -w --
#
# TumbleDRYer - a program to remove repetitions from a file
# Hugh Sasse
#
#


class TumbleDRYer
BUFFERSIZE = 1024
SHORTEST = 4
FEWEST = 2
def initialize(input = STDIN)
@suffices = Array.new
@suffices_size = nil
# Suffix array due to Jon Bentley, Programming Pearls.
@ziv_lempel_dict = Hash.new(0)
case input
when String
File.open(input) do |inf|
get_input(inf)
end
build_suffices()
when IO
get_input(input)
build_suffices()
else
raise "unknown input #{input}"
end
end

def get_input(source)
string = source.read()
# string = Regexp.quote(string)
@buffer = string.split(/\b/)
# so buffer is an array of strings.
@buffer.freeze
end

def build_suffices()
0.upto(@buffer.size) {|i| @suffices << i}
# so suffices in a collection of indices into buffer.
@suffices = @suffices.sort_by do |v|
@buffer[v..-1].join()
# i.e an alphanumeric case sensitive search.
end
@suffices.freeze
@suffices_size = @suffices.size
# puts "Suffices is #{@suffices.inspect}"
end

def substring(an_index)
# puts "in substring index is #{index} " # #{caller[0]}"
result = @buffer[@suffices[an_index]..-1].join('')
# puts "substring(#{an_index}) is #{result}"
return result
end

def buf_string(start, chars)
elems = 1
origin = @suffices[start]
result = @buffer[origin]
until result.length >= chars do
result += @buffer[origin+elems]
elems += 1
end
return result[0, chars]
end

def build_ziv_lempel()
0.upto(@suffices_size - 2) do |ind|
# puts "in build_ziv_lempel index is #{index}"
sb1=substring(ind)
sb2=substring(ind+1)
len=sb1.common_length(sb2)
next if len.zero?
# puts "len is now #{len} for '#{sb1[0..20]}...','#{sb2[0..20]}...'in build_ziv_lempel"
count = 1
ic1 = ind + count + 1
while (ic1 < @suffices_size) and (sb1.common_length(substring(ic1)) == len)
count += 1
ic1 += 1
end
@ziv_lempel_dict[ind] = [len, count]
end
# puts "ziv_lempel_dict is #{@ziv_lempel_dict.inspect}"
end

def ziv_lempel_ordered
# remove things that only happen once.
@ziv_lempel_dict.delete_if do |k,v|
v[0] < SHORTEST or v[1] < FEWEST
# @buffer[@suffices[k],v[0]] =~ /^\s+/ or
# @buffer[@suffices[k],v[0]] =~ /\s+$/
# don't substitute for whitespace chunks, they improve
# readability
end
# Sort by product of lenght * occurrences, then length, then where
# it occurred.
results = @ziv_lempel_dict.sort_by{|k,v| [v[0]*v[1],v,k]}.reverse
# puts "results is #{results.size} elements"
# puts results.inspect
# results.each do |ary|
# puts "%Q{#{buf_string(ary[0],ary[1][0])}} #{ary.inspect}"
# end
return results
end

# Produce the code to regenerate the input.
def output
results = ziv_lempel_ordered
# results.each do |ary|
# puts "now: %Q{#{buf_string(ary[0],ary[1][0])}} #{ary.inspect}"
# end
the_output = @buffer.join('')
variable = "@a"
variables = Array.new
results.each do |ary|
string = buf_string(ary[0],ary[1][0])
count = 0
re = Regexp.new(Regexp.quote(string))
the_output.gsub(re) do |s|
count += 1
s
end
if count >= 2
variables << [variable.dup, string.dup]
the_output.gsub!(re, "\#\{#{variable}\}")
variable.succ!
else
# puts "string #{string} /#{re}/ not found"
end
end
print <<EOF
#!/usr/local/bin/ruby -w --

# Takes DRY code and soaks it... undoes TumbleDRYer.
class WashingMachine
def initialize
EOF
variables.each do |ary|
puts " #{ary[0]} = %q{#{ary[1]}}"
end
print <<EOF
end
def output
# We need a string unlikely to be in the input as a terminator.
print <<BLOBULARIZATION
EOF
puts the_output
print <<EOF
BLOBULARIZATION
end
end

WashingMachine.new.output
EOF
end
end

module Enumerable
def common_length(other)
c = 0
len = self.size > other.size ? other.size : self.size ;
0.upto(len-1) do |i|
# puts "self is #{self.inspect}"
# puts "other is #{other.inspect}"
if self == other
c += 1
else
break
end
end
# l = self.length < 20 ? self.length : 20 ;
# m = other.length < 20 ? other.length : 20 ;

# puts " common_length(): #{self[0..l]}, #{other[0..m]} => #{c}"

return c
end

end

print "Enter Filename: "
name = STDIN.gets.chomp
td = TumbleDRYer.new(name)
puts
puts
td.build_ziv_lempel
puts
puts
td.output
 
D

Dale Martenson

Hi,

---- Original message from Hugh Sasse on 10/31/2005 11:41 AM:
You were right the first time. I was wondering how much of this
could be automated, and how good the results could be. My solution
looked as below. Note: I know the user dialogue should not appear
in the output but this was just to get it working, no time to tidy
it up as I'd wish. So I'd be interested in seeing your Huffman
approach.
Below is my Huffman coding version. A sample of the output in included
at the end to give you an idea of what gets generated.

require 'rubygems'
require_gem "usage"

class Huffman
class LeafNode
attr_accessor :type, :parent, :weight, :symbol

def initialize( symbol, weight )
@type = :leaf
@parent = nil
@symbol = symbol
@weight = weight
end
end

class InternalNode
attr_accessor :type,:parent, :left_child, :right_child, :weight

def initialize( node_left, node_right )
@type = :internal
@parent = nil
@left_child = node_left
@right_child = node_right

left_weight = (node_left != nil) ? node_left.weight : 0
right_weight = (node_right != nil) ? node_right.weight : 0

@weight = left_weight + right_weight

@left_child.parent = self if @left_child != nil
@right_child.parent = self if @right_child != nil
end
end

attr_reader :root, :encode_table, :decode_table

def initialize
@root = nil
@encode_table = {}
@decode_table = {}
end

def encode( string )
original_words = string.split(/ /)
all_words = original_words.sort

dictionary = Hash.new(0)
all_words.each do |word|
dictionary[ word ] += 1
end
sorted_dictionary = dictionary.sort {|a,b| a[1]<=>b[1]}

generate_huffman_tree( sorted_dictionary )
walk

encoding = ""
original_words.each do |word|
encoding += @encode_table[ word ]
end

encoding
end

def generate_huffman_tree( dictionary )
heap = []
dictionary.each do |item|
heap << LeafNode.new( item[0], item[1] )
end

heap.sort {|a,b| a.weight<=>b.weight}

while heap.size > 1
n1 = heap.shift
n2 = heap.shift
heap << InternalNode.new( n1, n2 )
heap.sort {|a,b| a.weight<=>b.weight}
end
@root = heap.shift
end

def walk(node=@root, encoding='', level=0)
if node != nil then
if node.type == :leaf then
# visit leaf
# print "symbol:#{node.symbol} encoding:#{encoding}
level:#{level}\n"
@encode_table[ node.symbol ] = encoding
@decode_table[ encoding ] = node.symbol
else
# must be internal
walk( node.left_child, encoding+'0',level+1 )
walk( node.right_child, encoding+'1',level+1 )
end
end
end

def decode( string )
output = ''
lookup = ''
string.each_byte do |b|
lookup << b.chr
if @decode_table.has_key?( lookup ) then
output += decode_table[ lookup ] + ' '
lookup = ''
end
end
output
end
end

h = Huffman.new

Usage.new "<infile" do |usage|
print <<EOT
# Undoes the work of TumbleDRYer.
class WashingMachine
def initialize
EOT

puts " @encoding = '#{h.encode(usage.infile.read)}'"
puts " @decode_table = {"
h.decode_table.each_pair { |k,v| puts " '#{k}' => %q{#{v}}," }
puts " }"

print <<EOT
end

def output
lookup = ''
@encoding.each_byte do |b|
lookup << b.chr
if @decode_table.has_key?( lookup ) then
print @decode_table[ lookup ] + ' '
lookup = ''
end
end
end
end

WashingMachine.new.output
EOT
end

************************************************
Example output:

# Undoes the work of TumbleDRYer.
class WashingMachine
def initialize
@encoding =
'1001000010111011000110011110000001001011100110100010011111011010101001110011010110001011011111111010101001110011010110001011011111011110101001110011010110001011011111000010101001110011010110001011011111100010101001110011010110001011011110001111110001110111001011110000101000011111110110011111111000100011111111110010111000000110011110000001001011100110100010011111111011011100111001101011000101101111000111100110111001101011000101101111000010100001111111011001111111100010001111111111001011000100011001111111000010010111001101011001110100111110011101001011100110101100101010001110010001111111111001011001100011001111000000100101110011010001001111100101010100111001101011000101101111000111111000111011100101111101011010010111001101011001110100111110000010110001110011010110011001001111100011010100111001101011000101101111000010100001111110100011110100011010110100100111101000110111'
@decode_table = {
'111111' => %q{
CREATE},
'111100' => %q{text},
'111001' => %q{NULL,
},
'110110' => %q{`authors`},
'110011' => %q{varchar(70)},
'110000' => %q{`categories`},
'101010' => %q{'0',
)},
'00101' => %q{TABLE},
'111101' => %q{`name`},
'110111' => %q{;

},
'110100' => %q{(`id`),
},
'110001' => %q{`password`},
'101110' => %q{varchar(20)},
'101011' => %q{`author_id`},
'101000' => %q{AUTO_INCREMENT=14},
'100010' => %q{`categories_documents`},
'110101' => %q{`document`},
'101111' => %q{`nickname`},
'101100' => %q{date},
'101001' => %q{(`title`)
)},
'100110' => %q{`documents`},
'100011' => %q{`filename`},
'100000' => %q{`date`},
'101101' => %q{`firstname`},
'100111' => %q{`document_id`},
'100100' => %q{ CREATE},
'100001' => %q{`contact`},
'100101' => %q{`title`},
'01010' => %q{varchar(50)},
'111010' => %q{'0',
},
'01110' => %q{NOT},
'01011' => %q{'',
},
'01000' => %q{KEY},
'00010' => %q{auto_increment,
},
'111110' => %q{AUTO_INCREMENT=3},
'111011' => %q{(`id`)
)},
'111000' => %q{`category_id`},
'110010' => %q{'0000-00-00',
},
'01111' => %q{},
'01100' => %q{default},
'01001' => %q{int(11)},
'00110' => %q{(
},
'00011' => %q{`description`},
'00000' => %q{`id`},
'01101' => %q{NULL},
'00111' => %q{TYPE=MyISAM},
'00100' => %q{;
},
'00001' => %q{PRIMARY},
}
end

def output
lookup = ''
@encoding.each_byte do |b|
lookup << b.chr
if @decode_table.has_key?( lookup ) then
print @decode_table[ lookup ] + ' '
lookup = ''
end
end
end
end

WashingMachine.new.output
 
D

Dale Martenson

For readability, I left the binary encoding as strings rather than
converting them to integers. This could easily be accomplished by
prefixing the encoding with '0b' and using String::to_i(0) to do the
conversion to an integer. This would be important if you wished to use
this as a compression technique.
 
H

Hugh Sasse

For readability, I left the binary encoding as strings rather than
converting them to integers. This could easily be accomplished by
prefixing the encoding with '0b' and using String::to_i(0) to do the
conversion to an integer. This would be important if you wished to use
this as a compression technique.

Yes, you could assemble the whole lot and use pack for that. It was
a nice approach. I think the problem with coding methods like the
one I used is getting the code words mapped to meaningful names so
one can still read the code. That would be beyond the scope of a
quiz though, I expect, because it would involve advanced natural
language programming to come up with the short concept names for the
code words.

I liked the Huffman coder. I have usually ducked out of using
Huffman coders because of managing all the tree rearrangement and
construction, but your code is very clear. I've usually used
Shannon-Fano code for such things in the past, though it is a little
less efficient (in terms of the codes it produces), because you only
sort the list of code words once.

Since the quiz is about reducing repetition in code: I think you
could probably use a superclass for the Internal and Leaf nodes to
hold common instance vars.

Thank you. I wonder if anyone else will have a go?
Hugh
 
D

Dominik Bathon

Here is my solution.

I found after some experimenting that it is surprisingly easy to get quite
good results. My solution outputs a stand alone ruby program that prints
the input. Here is a result:

$ ruby tumbledryer.rb tsql
r = {
"PKi" => "PRIMARY KEY (`id`)",
"d" => "`description`",
"NNd" => "NOT NULL default",
"v5NNd" => "varchar(50) NOT NULL default '',",
"ii1NNa" => "`id` int(11) NOT NULL auto_increment,",
"CT" => "CREATE TABLE",
"TMA" => "TYPE=MyISAM AUTO_INCREMENT=",
"i1NNd0" => "int(11) NOT NULL default '0',",
}

puts <<'EOT'.gsub(/\#(\w*)\#/){$1==""?"#":(r[$1]||"##$1#")}.chop
#CT# `authors` (
#ii1NNa#
`firstname` #v5NNd#
`name` #v5NNd#
`nickname` #v5NNd#
`contact` #v5NNd#
`password` #v5NNd#
#d# text NOT NULL,
#PKi#
) #TMA#3 ;

#CT# `categories` (
#ii1NNa#
`name` varchar(20) #NNd# '',
#d# varchar(70) #NNd# '',
#PKi#
) #TMA#3 ;

#CT# `categories_documents` (
`category_id` #i1NNd0#
`document_id` #i1NNd0#
) TYPE=MyISAM ;

#CT# `documents` (
#ii1NNa#
`title` #v5NNd#
#d# text NOT NULL,
`author_id` #i1NNd0#
`date` date #NNd# '0000-00-00',
`filename` #v5NNd#
#PKi#,
KEY `document` (`title`)
) #TMA#14 ;

EOT

It works as follows: First you need a good way to get all replacement
candidates, this is what my get_chunks method does:

# returns all substrings of line, that might be replacement candidates
# if candidates appear multiple times, they are returned multiple times
def get_chunks(line)
# split the line
parts = line.strip.scan /\w+|\W/
# build all combinations
(1..(parts.size)).collect do |len|
(0..(parts.size - len)).collect do |idx|
parts[idx, len].join ""
end
end.flatten.reject do |el|
el.length < @min_len || # to short
# reject (nonalpha separated by) whitespace at beginning and end
(el =~ /(^\W?\s)|(\s\W?$)/)
end
end

Then the rest is easy ;-).
Build a frequency hash of all those chunks from the entire text.
Then discard all chunks that don't appear often enough.
Then sort the remaining chunks by their length (longest first).
Then for each chunk (in sort order), scan the text and check if it appears
often enough, if yes replace it, if no discard it.
That's all.

My TumbleDRYer has three parameters:
min_len: is the minimum length of replacement chunks (>= 2)
min_freq: determines how often a chunk has to appear to be replaced
escape_char: may not be alphanumeric and not "\0", "\\", "\n", ...

The default values are 8, 3 and "#", they can be changed on the command
line:

$ ruby tumbledryer.rb '7-2-$' tsql

tumpledryer.rb reads ARGF, so this is also possible:

$ cat tsql | ruby tumbledryer.rb

The complete code is attached and below.

Thanks for this very interesting quiz.

Dominik


tumbledryer.rb:

class TumbleDRYer

# min_len is the minimum length of replacement chunks (>= 2)
# min_freq determines how often a chunk has to appear to be replaced
# escape_char may not be alphanumeric and not "\0", "\\", "\n", ...
def initialize(min_len = 8, min_freq = 3, escape_char = "#")
@min_len = [min_len.to_i, 2].max
@min_freq = min_freq.to_i
@escape_char = escape_char.to_s[0, 1]
@escape_char = "#" if @escape_char =~ /\w|\\|\0|\n/
end

# returns all substrings of line, that might be replacement candidates
# if candidates appear multiple times, they are returned multiple times
def get_chunks(line)
# split the line
parts = line.strip.scan /\w+|\W/
# build all combinations
(1..(parts.size)).collect do |len|
(0..(parts.size - len)).collect do |idx|
parts[idx, len].join ""
end
end.flatten.reject do |el|
el.length < @min_len || # to short
# reject (nonalpha separated by) whitespace at beginning and end
(el =~ /(^\W?\s)|(\s\W?$)/)
end
end

# get all replacement candidates (long enough chunks, that appear often
# enough) for text
def get_dry_candidates(text)
freq = Hash.new { |h, k| h[k] = 0 }
text.each do |line|
get_chunks(line).each do |chunk|
freq[chunk] += 1
end
end
freq.keys.reject { |chunk| freq[chunk] < @min_freq }
end

# generates a short alphanumeric name for str
def short_name(str)
res = str.scan(/\w+/).map { |w| w[0,1] }.join("")
res.empty? ? "a" : res
end

# returns a ruby program that prints text
# the representation of text in that program will be dried
# text may not contain "\0"
def dry(text)
text = text.dup
repl = get_dry_candidates(text).sort_by { |c| -c.size }
snames, used_repl = {}, {}
# do replacements
repl.each do |r|
if text.scan(r).size >= @min_freq # then use it
sn = short_name(r)
sn.succ! while snames.has_key? sn
# extra escape the short names, to avoid resubstitution by a
# shorter replacement ;-)
snames[sn] = sn.split(//).join("\0")
text.gsub!(r, "\0\0#{snames[sn]}\0\0")
used_repl[sn] = r
end
end
# escape the escape char
text.gsub!(@escape_char, @escape_char * 2)
# undo the extra short name escaping
snames.each do |sn, sn_esc|
text.gsub!("\0\0#{sn_esc}\0\0", @escape_char + sn + @escape_char)
end
# build result
res = ["r = {"]
used_repl.each do |s, r|
res << " #{s.inspect} => #{r.inspect},"
end
res << "}" << ""
# the "unDRYer"
esc = "\\#{@escape_char}"
res << "puts <<'EOT'.gsub(/#{esc}(\\w*)#{esc}/)" +
%Q!{$1==""?"#{@escape_char}":(r[$1]||! +
%Q!"#{@escape_char}\#$1#{@escape_char}")}.chop!
res << text << "EOT"
res.join "\n"
end
end

if $0 == __FILE__
args = []
if ARGV.first =~ /^(\d+)-(\d+)-(\W)$/
args = [$1.to_i, $2.to_i, $3]
ARGV.shift
end
puts TumbleDRYer.new(*args).dry(ARGF.read)
end
 
H

Hugh Sasse

A very nice solution. Much simpler than the coding theory approach,
and the results are very readable.

It might be interesting to extend it so that chunks can span
multiple lines, to ease the spotting of boilerplate code used in
many places.

Thank you
Hugh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top