Michael Ulm wrote:
--snip--
I assume that I know the dictionary size
and the probability with which each word is chosen for
the banned word list. Here is the part of ruby code I
use to create the banned word list
def initialize(size, prob)
@size = size
if prob <= 0 or prob >= 1 # sanity check
throw "illegal probability in LanguageFilter::initialize"
end
@badWords = []
@size.times {|k| @badWords.push(k) if rand < prob}
end
Note, that I use not words but an index of the words (so I
can use numbers 1...n instead of a word list). My object is
to minimize the expected number of email calls in this
scenario. Now, if I have such a word list with n words, I
can send an email of size k. If it does not get blocked, my
problem reduces to the same minimization problem with a\
smaller size n - k. If the email does get blocked, my
problem reduces to two smaller problems, one of size k
and one of size n - k. In the word list of size k, I do
have slightly more information then before: I know that
there is at least one blocked word in it (this does not
make much of a difference for large k, but a _lot_ of
difference for small k). Now I can find recursive formulas
for the two kinds of expected values (available upon
request).
I coded most of this, but unfortunately RL caught up with
me, and I will need another day or so until I can present a
full solution.
Just used my lunchbreak to implement the last details. I am
afraid the code is not very cleanly written, I totally ignored
speed so far, and I totally cheated in so far that I didn't use
words at all but numbers. Here is the output of two test runs.
6 words of 1000 found after 142 calls (expected 196.20329267521)
11 words of 1000 found after 208 calls (expected 274.947142087926)
27 words of 1000 found after 337 calls (expected 333.688742114315)
23 words of 1000 found after 287 calls (expected 382.652436479999)
51 words of 1000 found after 442 calls (expected 425.491312500002)
64 words of 1000 found after 494 calls (expected 462.720759999999)
73 words of 1000 found after 490 calls (expected 495.858497499998)
64 words of 1000 found after 489 calls (expected 528.357760000002)
104 words of 1000 found after 567 calls (expected 560.076936390001)
101 words of 1000 found after 562 calls (expected 583.697899999996)
11 words of 1000 found after 208 calls (expected 196.20329267521)
24 words of 1000 found after 304 calls (expected 274.947142087926)
23 words of 1000 found after 282 calls (expected 333.688742114315)
38 words of 1000 found after 374 calls (expected 382.652436479999)
48 words of 1000 found after 425 calls (expected 425.491312500002)
63 words of 1000 found after 468 calls (expected 462.720759999999)
76 words of 1000 found after 531 calls (expected 495.858497499998)
87 words of 1000 found after 543 calls (expected 528.357760000002)
94 words of 1000 found after 580 calls (expected 560.076936390001)
91 words of 1000 found after 542 calls (expected 583.697899999996)
================ here is the code, version 0.1 ================
# this holds classes and methods to "solve"
# ruby-talk quiz #9: Banned Words
#
#
# creates the language filter
class LanguageFilter
def initialize(size, prob)
@size = size
if prob <= 0 or prob >= 1 # sanity check
throw "illegal probability in LanguageFilter::initialize"
end
@badWords = []
@size.times {|k| @badWords.push(k) if rand < prob}
@Count = 0
end
def countReset
@Count = 0
end
def getBad
@badWords
end
# checks if one of the numbers in the array testText is on the
# badWords list
def clean?(testText)
sortedClean?(testText.sort)
end
# checks if one of the numbers in the array testText is on the
# badWords list testText is assumed to be sorted
def sortedClean?(testText)
@Count += 1
binStart, binEnd = 0, @badWords.size
testText.each do |current|
binStart = binarySearch(current, binStart, binEnd)
return false if @badWords[binStart] == current
end
return true
end
def test(testWords)
[@count, testWords.sort == @badWords]
end
private
def binarySearch(what, searchStart, searchEnd)
return searchEnd if what > @badWords[searchEnd - 1]
while searchEnd - searchStart > 1
testSearch = (searchStart + searchEnd) / 2
testWord = @badWords[testSearch]
return testSearch if testWord == what
if testWord < what
searchStart = testSearch
else
searchEnd = testSearch
end
end
return searchStart
end
end
class QueryStrategy
def initialize(prob)
@prob = prob
@qrob = 1.0 - prob
@eNone = [0, 1] # expected value for interval without additional info
@eOne = [0, 0] # expected values for interval with at least one bad
word
@kNone = [0, 1] # optimal interval query length in interval without
info
@kOne = [0, 0] # optimal interval query length in interval with one
bad word
@size = 2
end
def getStrategy(size, noInfo = true, quick = false)
if size < @size
return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size],
@eOne[size]]
end
qToN = Math::exp(@size * Math::log(@qrob))
@size.upto(size) do |n|
# compute F_p(n)
minExp = n.to_f
minK = 1
qToK = 1.0
1.upto(n - 1) do |k|
qToK *= @qrob
thisExp = qToK * @eOne[n - k] + (1 - qToK) * (@eOne[k] +
@eNone[n - k])
if thisExp < minExp
minK, minExp = k, thisExp
end
end
@kOne[n] = minK
@eOne[n] = 1.0 + minExp / (1 - qToN)
# compute E_p(n)
minExp = n.to_f
minK = 1
qToK = 1.0
1.upto(n) do |k|
qToK *= @qrob
thisExp = @eNone[n - k] + (1 - qToK) * @eOne[k]
if thisExp < minExp
minK, minExp = k, thisExp
end
end
@kNone[n] = minK
@eNone[n] = 1 + minExp
qToN *= @qrob
end
@size = size + 1;
return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size],
@eOne[size]]
end
def findWords(filter, size)
@myWords = []
getWordsRecursively(filter, 0...size, true)
@myWords
end
private
def getWordsRecursively(filter, range, noInfo)
rangesize = range.end - range.begin
return if rangesize == 0
if rangesize == 1
if noInfo
@myWords.push(range.begin) unless
filter.sortedClean?([range.begin])
else
@myWords.push(range.begin)
end
else
thisStrat = getStrategy(rangesize, noInfo)
testRange = range.begin...(range.begin + thisStrat[0])
testArray = []
testRange.each {|k| testArray.push(k)}
if filter.sortedClean?(testArray)
getWordsRecursively(filter, (range.begin +
thisStrat[0])...range.end, noInfo)
else
getWordsRecursively(filter, testRange, false)
getWordsRecursively(filter, (range.begin +
thisStrat[0])...range.end, true)
end
end
end
end
# test
testsize = 1000
10.times do |level|
testprob = 0.01 * (level + 1)
myfilt = LanguageFilter.new(testsize, testprob)
strat = QueryStrategy.new(testprob)
testWords = strat.findWords(myfilt, testsize)
number, found = myfilt.test(testWords)
if found
puts "#{testWords.size} words of #{testsize} found after #{number}
calls (expected #{strat.getStrategy(testsize)[1]})"
else
puts "word list not found after #{number} calls"
end
end
--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: (e-mail address removed)
Visit our Website:
www.isis-papyrus.com