[QUIZ] Banned Words (#9)

W

Wayne Vucenic

I removed the code to break the array into chunks whose size is a
power of two. This didn't make any significant difference in the
speed of the algorithm, but it does considerably simplify the code.
The method 'findBannedWords' goes away, and 'run' becomes

def run()
if @words.empty?
[]
else
findBanned(@words)
end
end

(And the comments on the other two methods about the size of their
input being a power of two also go away.)

Wayne
 
W

Wayne Vucenic

Hi James,
This "Divide and Conquer" approach is very effective when only a few
words are banned, but I believe there are better approaches for many
banned words...Anyone try anything closer to the suggested
10% blocked from the quiz?

I agree that if we know that 10% of the words are banned, we can do
better than this. If we have 1000 words with 10% banned, there's little
point in calling clean? on 1000, then the first 500, the last 500,
250, 250, 250, 250, 125, ... since there's a high probability that clean?
will return false on almost all of these. Since 10% is 1 in 10, it makes
sense to skip these larger chunks, and start with calling clean? on about
10 words, as there's a roughly equal chance that clean? will return true
as there is that it returns false, so we maximize the amount we learn by
calling clean? with about 10 words.

I've changed my original algorithm to start with chunks of 10, and then proceed
as before: if clean? on a chunk of 10 returns false, subdivide into 5 and 5,
etc. (The one changed routine is copied below.) This results in a moderate
improvement:

Original algorithm
Runs: 10
Words: 1000
Banned Words: 100
Average
Seconds: 51.9328
Mails: 596
Verified: 10/10
Rebuilt Banned Words

Start with chunks of 10 words
Runs: 10
Words: 1000
Banned Words: 100
Average
Seconds: 50.29
Mails: 514
Verified: 10/10
Rebuilt Banned Words

Of course, this requires knowing in advance the percentage of banned words.
If we didn't know this in advance we could try an adaptive algorithm that
figures out the percentage as it's running. Perhaps first make one pass
dividing by two each time: 1000, 500, 250, 125, 63, 32, 16, 8, 4, 2, 1.
(Starting the chunk at 0 each time: [0...1000], [0...500], etc.)
If there are really 10% banned words, they maybe clean? of 16 would be false,
but clean? of 8 would be true. From this, we _guess_ that the percentage of
banned words is between 1/16 and 1/8. We average these to get 0.09375 or 1
in every 10.6666. Rounding this to 1 in 11, we pick an initial chunk size
of 11, and do as in my algorithm above, but with chunks of 11. But we also
keep track of the percentage of banned words we've seen so far, and if this
percentage moves outside the range of 10.5 to 11.5, we adjust the chunk size
down or up, accordingly.

So, we quickly make a guess at the percentage, then as the algorithm runs we
improve our guess, and pick the corresponding chunk size.

Adaptive algorithms like this sometimes need damping, so that the chunk size
doesn't change wildly if we happen to hit a run of banned words, followed by
a run of non-banned words. But I'm not sure how much damping is needed in
this particular case, as the continually increasing denominator in our
percentage calculations will provide some damping.

Wayne

============================================

def run()
if @words.empty?
[]
else
biggestChunkSize = 10 # Use this for 10%
aBanned = []
iStart = 0
iEnd = iStart + biggestChunkSize
while iEnd <= @words.size
aBanned += findBanned(@words[iStart...iEnd])
iStart = iEnd
iEnd += biggestChunkSize
end
if iStart < @words.size
aBanned += findBanned(@words[iStart..-1])
end
aBanned
end
end
 
M

Michael Ulm

Some comments on the quiz:

If we have no information on the banned word list at all,
so every subset of the dictionary would be equally likely,
then we could not do better than brute force, i.e. testing
every word. This is because every query gives us one bit
of information, and for n words in the dictionary, there
are 2^n possible subsets, which have to be distinguished
by n bits of information.

So we will have to use some kind of additional knowledge
on the banned word list. Several variants are possible.
We could get some info on the probability distribution of
the list, or we may have an upper (and maybe a lower) limit
on the number of words in it. Then it is also not clear
what "as few emails as possible" means. This could mean
the maximal number of emails, or the expected number of
emails, or the minimization of some other cost functional.

So, it is possible to choose. Here is the choice I made
and analyzed: I assume that I know the dictionary size
and the probability with which each word is chosen for
the banned word list. Here is the part of ruby code I
use to create the banned word list

def initialize(size, prob)
@size = size
if prob <= 0 or prob >= 1 # sanity check
throw "illegal probability in LanguageFilter::initialize"
end

@badWords = []
@size.times {|k| @badWords.push(k) if rand < prob}
end

Note, that I use not words but an index of the words (so I
can use numbers 1...n instead of a word list). My object is
to minimize the expected number of email calls in this
scenario. Now, if I have such a word list with n words, I
can send an email of size k. If it does not get blocked, my
problem reduces to the same minimization problem with a\
smaller size n - k. If the email does get blocked, my
problem reduces to two smaller problems, one of size k
and one of size n - k. In the word list of size k, I do
have slightly more information then before: I know that
there is at least one blocked word in it (this does not
make much of a difference for large k, but a _lot_ of
difference for small k). Now I can find recursive formulas
for the two kinds of expected values (available upon
request).

I coded most of this, but unfortunately RL caught up with
me, and I will need another day or so until I can present a
full solution.

Michael


--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: (e-mail address removed)
Visit our Website: www.isis-papyrus.com
 
B

Brian Schröder

Hi James,

I agree that if we know that 10% of the words are banned, we can do
better than this.
[snip]

I tried to implement an algorithm that is perfect regarding the chunk size given the percentage information. This is described in

http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
or prettier
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

The problem is that this algorithm was not significantly better than always dividing into three parts. (Surprisingly it was even slower). Maybe my math is wrong there, if you detect the mind-bug it would be very interesting.

Regards,

Brian
 
M

Michael Ulm

Michael Ulm wrote:
--snip--
I assume that I know the dictionary size
and the probability with which each word is chosen for
the banned word list. Here is the part of ruby code I
use to create the banned word list

def initialize(size, prob)
@size = size
if prob <= 0 or prob >= 1 # sanity check
throw "illegal probability in LanguageFilter::initialize"
end

@badWords = []
@size.times {|k| @badWords.push(k) if rand < prob}
end

Note, that I use not words but an index of the words (so I
can use numbers 1...n instead of a word list). My object is
to minimize the expected number of email calls in this
scenario. Now, if I have such a word list with n words, I
can send an email of size k. If it does not get blocked, my
problem reduces to the same minimization problem with a\
smaller size n - k. If the email does get blocked, my
problem reduces to two smaller problems, one of size k
and one of size n - k. In the word list of size k, I do
have slightly more information then before: I know that
there is at least one blocked word in it (this does not
make much of a difference for large k, but a _lot_ of
difference for small k). Now I can find recursive formulas
for the two kinds of expected values (available upon
request).

I coded most of this, but unfortunately RL caught up with
me, and I will need another day or so until I can present a
full solution.
Just used my lunchbreak to implement the last details. I am
afraid the code is not very cleanly written, I totally ignored
speed so far, and I totally cheated in so far that I didn't use
words at all but numbers. Here is the output of two test runs.

6 words of 1000 found after 142 calls (expected 196.20329267521)
11 words of 1000 found after 208 calls (expected 274.947142087926)
27 words of 1000 found after 337 calls (expected 333.688742114315)
23 words of 1000 found after 287 calls (expected 382.652436479999)
51 words of 1000 found after 442 calls (expected 425.491312500002)
64 words of 1000 found after 494 calls (expected 462.720759999999)
73 words of 1000 found after 490 calls (expected 495.858497499998)
64 words of 1000 found after 489 calls (expected 528.357760000002)
104 words of 1000 found after 567 calls (expected 560.076936390001)
101 words of 1000 found after 562 calls (expected 583.697899999996)

11 words of 1000 found after 208 calls (expected 196.20329267521)
24 words of 1000 found after 304 calls (expected 274.947142087926)
23 words of 1000 found after 282 calls (expected 333.688742114315)
38 words of 1000 found after 374 calls (expected 382.652436479999)
48 words of 1000 found after 425 calls (expected 425.491312500002)
63 words of 1000 found after 468 calls (expected 462.720759999999)
76 words of 1000 found after 531 calls (expected 495.858497499998)
87 words of 1000 found after 543 calls (expected 528.357760000002)
94 words of 1000 found after 580 calls (expected 560.076936390001)
91 words of 1000 found after 542 calls (expected 583.697899999996)

================ here is the code, version 0.1 ================
# this holds classes and methods to "solve"
# ruby-talk quiz #9: Banned Words
#
#
# creates the language filter
class LanguageFilter
def initialize(size, prob)
@size = size
if prob <= 0 or prob >= 1 # sanity check
throw "illegal probability in LanguageFilter::initialize"
end

@badWords = []
@size.times {|k| @badWords.push(k) if rand < prob}
@Count = 0
end

def countReset
@Count = 0
end

def getBad
@badWords
end

# checks if one of the numbers in the array testText is on the
# badWords list
def clean?(testText)
sortedClean?(testText.sort)
end

# checks if one of the numbers in the array testText is on the
# badWords list testText is assumed to be sorted
def sortedClean?(testText)
@Count += 1
binStart, binEnd = 0, @badWords.size
testText.each do |current|
binStart = binarySearch(current, binStart, binEnd)
return false if @badWords[binStart] == current
end
return true
end

def test(testWords)
[@count, testWords.sort == @badWords]
end

private

def binarySearch(what, searchStart, searchEnd)
return searchEnd if what > @badWords[searchEnd - 1]
while searchEnd - searchStart > 1
testSearch = (searchStart + searchEnd) / 2
testWord = @badWords[testSearch]
return testSearch if testWord == what
if testWord < what
searchStart = testSearch
else
searchEnd = testSearch
end
end
return searchStart
end
end

class QueryStrategy
def initialize(prob)
@prob = prob
@qrob = 1.0 - prob
@eNone = [0, 1] # expected value for interval without additional info
@eOne = [0, 0] # expected values for interval with at least one bad
word
@kNone = [0, 1] # optimal interval query length in interval without
info
@kOne = [0, 0] # optimal interval query length in interval with one
bad word
@size = 2
end

def getStrategy(size, noInfo = true, quick = false)
if size < @size
return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size],
@eOne[size]]
end

qToN = Math::exp(@size * Math::log(@qrob))
@size.upto(size) do |n|
# compute F_p(n)
minExp = n.to_f
minK = 1
qToK = 1.0
1.upto(n - 1) do |k|
qToK *= @qrob
thisExp = qToK * @eOne[n - k] + (1 - qToK) * (@eOne[k] +
@eNone[n - k])
if thisExp < minExp
minK, minExp = k, thisExp
end
end
@kOne[n] = minK
@eOne[n] = 1.0 + minExp / (1 - qToN)

# compute E_p(n)
minExp = n.to_f
minK = 1
qToK = 1.0
1.upto(n) do |k|
qToK *= @qrob
thisExp = @eNone[n - k] + (1 - qToK) * @eOne[k]
if thisExp < minExp
minK, minExp = k, thisExp
end
end
@kNone[n] = minK
@eNone[n] = 1 + minExp

qToN *= @qrob
end

@size = size + 1;
return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size],
@eOne[size]]
end

def findWords(filter, size)
@myWords = []
getWordsRecursively(filter, 0...size, true)
@myWords
end

private
def getWordsRecursively(filter, range, noInfo)
rangesize = range.end - range.begin
return if rangesize == 0
if rangesize == 1
if noInfo
@myWords.push(range.begin) unless
filter.sortedClean?([range.begin])
else
@myWords.push(range.begin)
end
else
thisStrat = getStrategy(rangesize, noInfo)
testRange = range.begin...(range.begin + thisStrat[0])
testArray = []
testRange.each {|k| testArray.push(k)}
if filter.sortedClean?(testArray)
getWordsRecursively(filter, (range.begin +
thisStrat[0])...range.end, noInfo)
else
getWordsRecursively(filter, testRange, false)
getWordsRecursively(filter, (range.begin +
thisStrat[0])...range.end, true)
end
end
end
end

# test
testsize = 1000
10.times do |level|
testprob = 0.01 * (level + 1)

myfilt = LanguageFilter.new(testsize, testprob)

strat = QueryStrategy.new(testprob)
testWords = strat.findWords(myfilt, testsize)
number, found = myfilt.test(testWords)
if found
puts "#{testWords.size} words of #{testsize} found after #{number}
calls (expected #{strat.getStrategy(testsize)[1]})"
else
puts "word list not found after #{number} calls"
end
end


--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: (e-mail address removed)
Visit our Website: www.isis-papyrus.com
 
J

James Edward Gray II

I removed the code to break the array into chunks whose size is a
power of two. This didn't make any significant difference in the
speed of the algorithm, but it does considerably simplify the code.
The method 'findBannedWords' goes away, and 'run' becomes

def run()
if @words.empty?
[]
else
findBanned(@words)
end
end

(And the comments on the other two methods about the size of their
input being a power of two also go away.)

I was trying to update your code for the quiz archives based on this
message and got confused. Would you please repost your solution as it
stands now?

Thanks.

James Edward Gray II
 
W

Wayne Vucenic

Would you please repost your solution as it stands now?

Gladly!

Wayne

----------------------------------------------------------

class YourAlgorithm < RQ9Algorithm
# Returns an array containing all banned words from @words
def run()
if @words.empty?
[]
else
findBanned(@words)
end
end

# Returns an array containing all banned words from aWords
# aWords.size is > 0
def findBanned(aWords)
if aWords.size == 1
@filter.clean?(aWords[0]) ? [] : aWords
elsif @filter.clean?(aWords.join(' '))
[]
else
iSplit = aWords.size / 2
if @filter.clean?(aWords[0...iSplit].join(' '))
# There is at least one banned word in 0..-1, but not in 0...iSplit,
# so there must be one in iSplit..-1
findBannedThereIsOne(aWords[iSplit..-1])
else
# From the test above we know there is a banned word in 0...iSplit
findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
end
end
end

# Returns an array containing all banned words from aWords
# aWords.size is > 0
# Our caller has determined there is at least one banned word in aWords
def findBannedThereIsOne(aWords)
if aWords.size == 1
# Since we know there is at least one banned word, and since there is
# only one word in the array, we know this word is banned without
# having to call clean?
aWords
else
iSplit = aWords.size / 2
if @filter.clean?(aWords[0...iSplit].join(' '))
# There is at least one banned word in 0..-1, but not in 0...iSplit,
# so there must be one in iSplit..-1
findBannedThereIsOne(aWords[iSplit..-1])
else
# From the test above we know there is a banned word in 0...iSplit
findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
end
end
end
end
 
F

Florian G. Pflug

Brian said:
I tried to implement an algorithm that is perfect regarding the chunk size given the percentage information. This is described in

http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
or prettier
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

The problem is that this algorithm was not significantly better than always dividing into three parts. (Surprisingly it was even slower). Maybe my math is wrong there, if you detect the mind-bug it would be very interesting.

I believe that you oversimplified the problem, and therefore your math
is wrong (or, rather, a too coarse approximation, as your results prove).

From your paper:
N... Number of Words.
n... Number of banned words.
r:=P[word w is banned]=n/N
p:=P[slice of size x passes] = (1-r)^x.

Your formula for p ((1-r)^x) assumes that the probability r is constant,
while in reality it changes with every word you put into your slice.

The probability of a slice of size x passing is the same as the
probability that none of 100 words you randomly pick is banned.
Now, while you pick (you pick only non-banned words, because you
immediatly stop if you get a banned one), the ration banned/total
for the _remaining_ set constantly changes.

You can instantly see that your formula cannot be exact if you
try to calculate the probability of a slice of size N passing.
Your formulat gives (1-n/N)^N, which is probably quite small, but still
greater than zero. The exact probability is obviosly zero (assuming n > 0)

Additionally, the probability r changes everytime you find a non-banned
slice. If, for example, you test 500 out of 100 words, and the slice
passes, the being-banned probability of the rest of the words doubles.

greetings, Florian Pflug
 
B

Brian Schröder

Brian said:
I tried to implement an algorithm that is perfect regarding the chunk size
given the percentage information. This is described in

http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
or prettier
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

The problem is that this algorithm was not significantly better than always
dividing into three parts. (Surprisingly it was even slower). Maybe my math
is wrong there, if you detect the mind-bug it would be very interesting.

I believe that you oversimplified the problem, and therefore your math
is wrong (or, rather, a too coarse approximation, as your results prove).

From your paper:
N... Number of Words.
n... Number of banned words.
r:=P[word w is banned]=n/N
p:=P[slice of size x passes] = (1-r)^x.

Your formula for p ((1-r)^x) assumes that the probability r is constant,
while in reality it changes with every word you put into your slice.

The probability of a slice of size x passing is the same as the
probability that none of 100 words you randomly pick is banned.
Now, while you pick (you pick only non-banned words, because you
immediatly stop if you get a banned one), the ration banned/total
for the _remaining_ set constantly changes.

You can instantly see that your formula cannot be exact if you
try to calculate the probability of a slice of size N passing.
Your formulat gives (1-n/N)^N, which is probably quite small, but still
greater than zero. The exact probability is obviosly zero (assuming n > 0)

Additionally, the probability r changes everytime you find a non-banned
slice. If, for example, you test 500 out of 100 words, and the slice
passes, the being-banned probability of the rest of the words doubles.

greetings, Florian Pflug

Hello Florian,

thank you for taking the time to think about my ramblings. I'd love to take
some time to think more about this, so I think I won't have it. But I'm shure
to have learned something from you here.

best regards,

Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,163
Messages
2,570,897
Members
47,434
Latest member
TobiasLoan

Latest Threads

Top