"hat" container class [C++]

S

Siemel Naran

Sorry, maybe not very clear. It might help if we used chars for the items,
in order to distinguish the items from a virtual lookup table. Let's
redefine the set of objects as {'a', 'b', 'c', 'd', 'e', 'f' }, and say
that 'a' has a 50% chance of being selected. The proposed virtual array
might be { 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 }, which refer to item numbers. And
further, let's say that we rolled a three out of ten.

Can item #1 be removed from the virtual mapping array in less
examinations/calculations than is linearly proportional to the number of
records?

If you use a special data structure, perhaps yes. You could use a standard
heap.
 
A

AngleWyrm

Siemel Naran said:
If you use a special data structure, perhaps yes. You could use a standard
heap.

And this is what I have been working on, a special kind of data structuring
in the form of a binary tree that never needs rebalancing. It's called a
male-female binary tree, and it can do inserts, deletes, and searches that
require at most a number of examinations/calculation that is
logarithmically proportional to the number of records.

http://home.comcast.net/~anglewyrm/hat.html
 
A

AngleWyrm

In studying your program, and my container, it has become apparent to me
that the constant RAND_MAX is important, as it defines the valid domain of
the random functions.

It also seems inappropriate to me to declare a function that requires
additional pre-existing outside values; the interface to the function
should specify everything necessary, without any assumptions. The scenario
of a library user trying to use the function, but without (re)defining
RAND_MAX could introduce erroneous assumptions about the actual range of
values returned by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able to specify
RAND_MAX to this function, but with a default value. Does you know how this
might be done?
 
K

Kai-Uwe Bux

AngleWyrm said:
In studying your program, and my container, it has become apparent to me
that the constant RAND_MAX is important, as it defines the valid domain of
the random functions.

It also seems inappropriate to me to declare a function that requires
additional pre-existing outside values; the interface to the function
should specify everything necessary, without any assumptions. The scenario
of a library user trying to use the function, but without (re)defining
RAND_MAX could introduce erroneous assumptions about the actual range of
values returned by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able to
specify RAND_MAX to this function, but with a default value. Does you know
how this might be done?

Hi,


am I assuming correctly that you are trying to solve the following problem:

The user shall be allowed to provide a RNG of his choosing. But this RNG is
supposed to be a raw RNG, i.e.:
a) the RNG provides random numbers in [0,N) for some N but
b) this RNG does not support the signature

unsigned long R ( unsigned long upper_bound );

returning a number in [0,upper_bound)

Thus, you want to rescale the raw random numbers in [0,N) into the needed
range yourself by some code of yours. I am not sure if that is a good
design decission. The Standard Library sets a different precedent in
std::random_shuffle, and I think for a good engineering reason:

The caution about low-order bits in the RNG is a good rule of thumb. It
applies foremost to linear congruence RNGs which are quite common. However,
in general some knowledge about the underlying RNG is required to code the
appropriate method of rescaling from [0,RAND_MAX) to [0,upper). Since you
cannot predict the RNG that the client will use, it is probably best to
leave the rescaling problem to the client too.


As for the technical C++ question about templates, what about passing
RAND_MAX as a second parameter:

template < typename T >
T random_function ( T range,
T max_range = static_cast<T>( RAND_MAX ) ) {

}

But, maybe I am totally misunderstanding what you are trying to do.


Best

Kai-Uwe
 
A

AngleWyrm

am I assuming correctly that you are trying to solve the following:
The user shall be allowed to provide a RNG of his choosing.
But this RNG is supposed to be a raw RNG, i.e.:
a) the RNG provides random numbers in [0,N) for some N but
b) this RNG does not support the signature

unsigned long R ( unsigned long upper_bound );

returning a number in [0,upper_bound)

Thus, you want to rescale the raw random numbers in [0,N) into the needed
range yourself by some code of yours. I am not sure if that is a good
design decission. The Standard Library sets a different precedent in
std::random_shuffle, and I think for a good engineering reason:

Hi,
That is not quite what I meant to express. I agree that if an rng is
designed to be given a range, then it should be used that way. I was to
support two signatures:

a). unsigned long R ( unsigned long upper_bound ) // as in the stl
b). int R ( void ) // as in std::rand()

In the first case, the user's rng can be expected to handle it's own
domain, but in the second case a scaling function is required, which needs
RAND_MAX (or it's equivalent for user-defined functions with that
signature). Thus RAND_MAX is only relevant in the second case. Perhaps it
would be best to depreciate the second case, even though std::rand() is the
default if no random number generator is specified.

I can create a wrapper function for std::rand(), so that there is
essentially only one function signature to support. Then using the two
would be easier:


// wrapper redefines function signature for std::rand
unsigned long wrapper( unsigned long rng ){
return (unsigned long) /* use std::rand() */ 100;
};

// a custom generator
unsigned long custom( unsigned long rng ){
return (unsigned long) /* use custom method */ 200;
};

// templatized container, that optionally takes an RNG function
template < class T, unsigned long (*rng_func)(unsigned long range) =
wrapper>
class container{
public:
unsigned long (*random)(unsigned long r);
container(void):random(rng_func){};
};

int main()
{
container<int> my_container;
cout << my_container.random(1);

container<int, custom> my_other_container;
cout << endl << my_other_container.random(2);

system("pause");
}
 
A

AngleWyrm

Kai-Uwe Bux said:
Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of
your class will probably use the default RNG of RNGs from that library.


New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;
 
M

Martin Eisenberg

AngleWyrm said:
In studying your program, and my container, it has become
apparent to me that the constant RAND_MAX is important, as it
defines the valid domain of the random functions.

It also seems inappropriate to me to declare a function that
requires additional pre-existing outside values; the interface
to the function should specify everything necessary, without any
assumptions. The scenario of a library user trying to use the
function, but without (re)defining RAND_MAX could introduce
erroneous assumptions about the actual range of values returned
by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able
to specify RAND_MAX to this function, but with a default value.
Does you know how this might be done?

Do I understand correctly that you want to provide a library of RNG
functions that can be used with your hat? If so, why do you want to
pull in RAND_MAX? RAND_MAX pertains to a particular generator, namely
std::rand(), and you're not bound to it for your own generators. I'd
even think you don't want to be since it would constrain the maximal
sum of weights of hat containees once and for all. Another
possibility is to make your RNG's functors and have them provide
their respective maxima. Let me illustrate what I mean:

#include <limits>
#include <algorithm>
#include <cstdlib>

struct NonRNG {
enum { Max = std::numeric_limits<unsigned long>::max() };

unsigned long operator() (unsigned long range = Max)
{
return std::min(++state, range);
}

private:
unsigned long state;
};

struct StdRNG {
enum { Max = RAND_MAX };

StdRNG() : state(0) {}
StdRNG(unsigned int seed) : state(seed) {}
StdRNG(const StdRNG& rhs) : state(rhs.state) {}

unsigned long operator() (unsigned long range = Max)
{
std::srand(state);
state = std::rand();
return state % (range + 1);
} // Yes, I have read that % is bad here...

private:
unsigned int state;
};

template<typename T, typename Random = StdRNG>
class hat {
public:
hat(const Random& useRandom = Random()) : random(useRandom) {}

void put(T item, unsigned long weight)
{
if(tree[0].family_weight > Random::Max - weight)
throw "Sum of weights too large!";
// ...
}

T& peek()
{
unsigned long random_weight = random( tree[0].family_weight );
return tree[ find_index(0, random_weight) ].self;
}

private:
Random random;
};

Finally, this might answer your literal question:

template<typename T, T Max = T(RAND_MAX)>
T random_function(T range = Max);


Martin
 
K

Kai-Uwe Bux

AngleWyrm said:
Kai-Uwe Bux said:
Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of
your class will probably use the default RNG of RNGs from that library.


New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;

Hi,


I had a look. Looks pretty cool.

As for the O(log N) performance: You might want to have a look at

Y. Matias, J.S. Vitter, W.-C. Hi:
Dynamic Generation of Discrete Random Variates
Symposium on Discrete Algorithms 4 (1993) 361-370

They describe a more complicated method that yields better asymptotic
performance, expected amortized O(log* N) to be precise. Moreover, in the
degenerate but importante special case of a uniform distribution, their
method is expected amortized constant time. I have not seen any
implementation of their method, so I cannot say whether the overhead
generated by their more involved approach will eat away the asymptotic
advantage for all practical purposes. In particular, I do *not* suggest
changing your implementation. But I thought, you might be interested anyway.


Best

Kai-Uwe
 
K

Kai-Uwe Bux

AngleWyrm said:
Kai-Uwe Bux said:
Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of
your class will probably use the default RNG of RNGs from that library.


New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;

Hi,


I have thought a little about the interface:

void put( ... ) // insert an object
T& peek() // obtain a random object with replacement
T pull() // obtain a random object without replacement

Note that the last method has to return the object by value. This could be
inefficient. Maybe a stack container like interface provides a slightly
better solution, like so:

void put() // insert an object
T& peek() // obtain a random object
void pop() // remove the object that was obtained by the last peek()

Now, the client code need not need to know in advance whether the object
shall be replaced or not. This can determined based upon which object was
drawn from the hat. If you wanted to do that with the current interface,
you would need a pull() returning the object by copy followed maybe by a
put(), also copying the object.

On second thought, the pull() method is still a very common special case
and probably should be left in the interface. Internally, it could be
realized as:

T pull ( void ) {
T result ( peek() );
pop();
return( result );
}

What do you think?


Best

Kai-Uwe
 
J

Joe Gottman

Kai-Uwe Bux said:
I have thought a little about the interface:

void put( ... ) // insert an object
T& peek() // obtain a random object with replacement
T pull() // obtain a random object without replacement

Note that the last method has to return the object by value. This could be
inefficient. Maybe a stack container like interface provides a slightly
better solution, like so:

void put() // insert an object
T& peek() // obtain a random object
void pop() // remove the object that was obtained by the last peek()

Now, the client code need not need to know in advance whether the object
shall be replaced or not. This can determined based upon which object was
drawn from the hat. If you wanted to do that with the current interface,
you would need a pull() returning the object by copy followed maybe by a
put(), also copying the object.

Your definition of pop() is very limiting. It makes it impossible to look
at several elements of the hat then remove one of them according to some
criterion (for instance, pick three elements at random and remove the median
value).

One possible solution is to have peek() return some kind of iterator into
the hat, and let pop() take an iterator as a parameter. On second thought,
a pop() function taking an iterator parameter is more aptly called erase(),
for consistency with the STL.

Joe Gottman
 
K

Kai-Uwe Bux

Joe said:
Your definition of pop() is very limiting. It makes it impossible to
look
at several elements of the hat then remove one of them according to some
criterion (for instance, pick three elements at random and remove the
median value).

One possible solution is to have peek() return some kind of iterator
into
the hat, and let pop() take an iterator as a parameter. On second
thought, a pop() function taking an iterator parameter is more aptly
called erase(), for consistency with the STL.

Joe Gottman

You are right.

So what about this interface:

class hat {
public:

a) constructor/destructor/...

b) provide an iterator type

iterator insert ( const Object & obj, unsigned long weight = 1 );
// enter an object and return the position where it was entered

iterator get ( void ) const;
// get an iterator to a random element
// according to the current weights

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.
}

void erase ( const iterator & iter );
// or maybe, I should get rid of this one altogether?
}

void clear ( void );
// damn, the whole thing is completely messed up.
// let's get rid of it all.

unsigned long size ( void ) const;
// does what you think it does

iterator begin ( void ) const;
// now, this is a container after all.

iterator end ( void ) const;
// see above.

}; // hat


Best

Kai-Uwe Bux
 
A

AngleWyrm

Kai-Uwe Bux said:
Joe Gottman wrote:
So what about this interface:

class hat {
public:

a) constructor/destructor/...

b) provide an iterator type

iterator insert ( const Object & obj, unsigned long weight = 1 );
// enter an object and return the position where it was entered

iterator get ( void ) const;
// get an iterator to a random element
// according to the current weights

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.
}

void erase ( const iterator & iter );
// or maybe, I should get rid of this one altogether?
}

Revision 1.14:

Implemented the erase(iterator) functionality, and removed the returns by value
constraint from the pull() function. There is considerable merit to dynamically
changing weight values through the set_weight() functions, so it will probably
be next.

Iterators currently do something like the way associative containers return
pairs, although I'm not sure that this is the best way to go. For instance:

hat<int>::iterator iter = some_hat.begin();
cout << "item: " << iter->item << " weight: " << iter->weight;

This offers a way to rapidly access item and weight values, but unfortunately
suggests that it might be possible to directly modify weight through an
iterator. That cannot be the case, just as is true of keys in an associative
container.
 
K

Kai-Uwe Bux

AngleWyrm said:
Revision 1.14:

Implemented the erase(iterator) functionality, and removed the returns by
value constraint from the pull() function. There is considerable merit to
dynamically changing weight values through the set_weight() functions, so
it will probably be next.

Iterators currently do something like the way associative containers
return pairs, although I'm not sure that this is the best way to go. For
instance:

hat<int>::iterator iter = some_hat.begin();
cout << "item: " << iter->item << " weight: " << iter->weight;

This offers a way to rapidly access item and weight values, but
unfortunately suggests that it might be possible to directly modify weight
through an iterator. That cannot be the case, just as is true of keys in
an associative container.

I found the set/get_weight methods quite usefull: for instance you can
simulate a hat containing /0000 As and B0000 Bs just like so:

CharHat h;
h.insert( 'A', 70000 );
h.insert( 'B', 50000 );
while ( ! h.empty() ) {
CharHat::iterator iter = h.grab();
std::cout << *iter << std::endl;
CharHat::weight_type weight = h.get_weight( iter );
--weight;
if ( weight == 0 ) {
h.erase( iter );
} else {
h.set_weight( iter, weight );
}
}
h.clear();

So, here I am assuming that iterators just point to the objects and not to
some struct containing the object and its weight. As you pointed out, once
you settle on the pair semantics for iterators, it is natural to wish that
something like

iter->weight = 5

changes the weight. If you feel so inclined, you could achieve this using
reference proxies: the iterator would define its operator* and operator->
methods to return reference and pointer proxies which in turn need to
cleverly define the assignment operator= and a conversion operator so that
they can be used as rvalues too. In my opinion, this is not quite worth the
effort. Thus, I would suggest:

template < typename Object,
typename unsigned_type = unsigned long,
typename RNG = CstdlibRandomNumberGenerator >
class hat {
private:
typedef std::vector< Object > ObjectVector;

public:

typedef Object value_type;
typedef typename ObjectVector::iterator iterator;
typedef typename ObjectVector::const_iterator const_iterator;
typedef typename ObjectVector::reference reference;
typedef typename ObjectVector::const_reference const_reference;
typedef typename ObjectVector::pointer pointer;
typedef typename ObjectVector::size_type size_type;
typedef typename ObjectVector::difference_type difference_type;
typedef unsigned_type weight_type;

hat ( void );

hat ( const RNG & rng );

hat ( const hat & other );

~hat ( void );

const hat & operator= ( const hat & other );

bool operator< ( const hat & other );

bool operator<= ( const hat & other );

bool operator> ( const hat & other );

bool operator>= ( const hat & other );

bool operator== ( const hat & other );

bool operator!= ( const hat & other );

void reserve ( size_type cap );

iterator insert ( const Object & obj,
unsigned_type weight = 1 );

const_iterator grab ( void ) const;

template < typename Another_RNG >
const_iterator grab ( Another_RNG & rng ) const;

iterator grab ( void );

template < typename Another_RNG >
iterator grab ( Another_RNG & rng );

weight_type get_weight ( const iterator & iter ) const;

void set_weight ( const iterator & iter,
const unsigned_type & new_weight );

void erase ( const iterator & iter );

void clear ( void );

size_type size ( void ) const;

bool empty ( void ) const;

const_iterator begin ( void ) const;

iterator begin ( void );

const_iterator end ( void ) const;

iterator end ( void );


/*
| The following are just some shortcuts that
| might be usefull for standard problems. They also
| serve for compatibility to the hat container by
| AngleWyrm.
*/

const value_type & peek ( void ) const;

template < typename Another_RNG >
const value_type & peek ( Another_RNG & rng ) const;

value_type pull ( void );

template < typename Another_RNG >
value_type pull ( Another_RNG & rng );

}; // hat

This interface, is what I have implemented based on your ideas. As you can
see, I prefer somewhat fat interfaces for library containers.
 
A

AngleWyrm

Kai-Uwe Bux said:
AngleWyrm wrote:

So, here I am assuming that iterators just point to the objects and not to
some struct containing the object and its weight. As you pointed out, once
you settle on the pair semantics for iterators, it is natural to wish that
something like

iter->weight = 5

changes the weight. If you feel so inclined, you could achieve this using
reference proxies: the iterator would define its operator* and operator->
methods to return reference and pointer proxies which in turn need to
cleverly define the assignment operator= and a conversion operator so that
they can be used as rvalues too. In my opinion, this is not quite worth the
effort.

This shines some light on an issue about the interface that has been bothering
me for a while now: Ideally an iterator should dereference to the contained
objects, increment over contained objects, not some in-between node. The current
arrangement feels a bit like exposing the implementation, a thing I would
rather not do. In persuing iterator proxies, it occurred to me that I might just
be glossing over the issue, and with that thought I suddenly felt like I was
working on something partway between a kludge and outright misrepresentation of
the functionality.

So now I am desirous to attain a clean iterator, free from such burdens of the
current implementation. My approach is going to be rather than storing the data
within a node structure, it will be stored within it's own array. It will then
become possible to properly use iterator de-referencing for the items, and
weights will be through a get_weight/set_weight interface. Were the iterators to
point directly to the objects, then comparison functions on objects would be
handled by those objects, as they should be.

I've also been working on a variant of something we persued in an earlier
thread--a different set of performance characteristics. In situations where the
container will be loaded once, and then drawn upon many times for random items,
it becomes desirable to have the fastest possible access, even at the expense of
a little setup time. It is possible to create a fast_peek() function that has
constant-time access to the contained items, if one pays in advance with an
initialize_fast_peek(). The method behind this scheme is to arrange a lookup
table of iterators, which are repeated according to their weights. The weights
are scaled by the greatest common divisor of the set of weights, and thus the
number of times a given object is pointed to by repeated iterators in the table
is minimized.
 
A

AngleWyrm

Revision 1.20 is up
It now supports the get_weight/set_weight functionality.
The storage mechanism has been changed, so that iterators are no longer pointing
to nodes, but to an actual container of items. It is now possible to dereference
an iterator and get exactly what is expected from such behavior.
 
K

Kai-Uwe Bux

AngleWyrm said:
I've also been working on a variant of something we persued in an earlier
thread--a different set of performance characteristics. In situations
where the container will be loaded once, and then drawn upon many times
for random items, it becomes desirable to have the fastest possible
access, even at the expense of a little setup time. It is possible to
create a fast_peek() function that has constant-time access to the
contained items, if one pays in advance with an initialize_fast_peek().
The method behind this scheme is to arrange a lookup table of iterators,
which are repeated according to their weights. The weights are scaled by
the greatest common divisor of the set of weights, and thus the number of
times a given object is pointed to by repeated iterators in the table is
minimized.

I am not sure that I understand the method you are proposing. Could you
explain the structure of this lookup table? Also, what would be its size
and what would be the time needed to set it up. I am deeply confused by
this gcd bit. I hope, the table size does not explode if someone specified
weights like:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.


Best

Kai-Uwe Bux
 
A

AngleWyrm

Kai-Uwe Bux said:
I am not sure that I understand the method you are proposing. Could you
explain the structure of this lookup table? Also, what would be its size
and what would be the time needed to set it up. I am deeply confused by
this gcd bit. I hope, the table size does not explode if someone specified
weights like:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.

You have hit it on the head as far as the performance cost of initializing.
Table of pointers size would be a function of the weights; the largest table
size is the sum of the weights -- a worst-case scenario that takes place when
the weights are unique primes. Setup time is also atrocious, being a function of
weights x N, significantly longer than linear performance. It's probably not
worth the cost, both in space and time, of a setup routine in exchange for
constant-time read-only access later.
 
K

Kai-Uwe Bux

AngleWyrm said:
You have hit it on the head as far as the performance cost of
initializing. Table of pointers size would be a function of the weights;
the largest table size is the sum of the weights -- a worst-case scenario
that takes place when the weights are unique primes. Setup time is also
atrocious, being a function of weights x N, significantly longer than
linear performance. It's probably not worth the cost, both in space and
time, of a setup routine in exchange for constant-time read-only access
later.

Hi,


from this reply, I can actually guess what you had in mind. Now, that
lookup table would allow you to draw items with replacement. Removing an
object from the table would mean to remove all the copies of iterators
pointing to it, which comes with prohibitive costs.

Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

I am sorry, but I did not have the time to document the code properly.

// walker.cc (C) Kai-Uwe Bux [2004]
// ================================

#include <vector>
#include <utility>

class WalkerAlias {
/*
| This class crudely implements J.A. Walker's alias method.
| See:
|
| D.E. Knuth: "The Art of Computer Programming",
| Vol. 2, p. 120f
*/
public:

typedef unsigned long WeightType;
typedef std::vector< unsigned long > WeightVector;
typedef WeightVector::size_type IndexType;
typedef std::pair< WeightType, IndexType > EntryType;

private:

std::vector< EntryType > table;
WeightType total;

// ==================================================
// | accessor methods |
// ==================================================

inline
const IndexType & the_index ( const IndexType & pos ) const {
return( table[pos].second );
}

inline
IndexType & the_index ( const IndexType & pos ) {
return( table[pos].second );
}

inline
const WeightType & the_weight ( const IndexType & pos ) const {
return( table[pos].first );
}

inline
WeightType & the_weight ( const IndexType & pos ) {
return( table[pos].first );
}

inline
IndexType the_size ( void ) const {
return( table.size() );
}

// ==================================================
// | init data |
// ==================================================
/*
| This init routine is a bad hack: it is hard to
| understand and only conjectured to be O(n).
*/

inline
void setup_table ( const WeightVector & weights ) {
table.reserve( weights.size() );
for ( IndexType pos = 0; pos < weights.size(); ++pos ) {
table.push_back( EntryType( weights[pos]*weights.size(),
weights.size() ) );
total += weights[pos];
}
IndexType big_begin = 0;
for ( IndexType pos = 0; pos < the_size(); ++pos ) {
fix_slot( pos, big_begin );
}
}

inline
void fix_slot ( IndexType slot,
IndexType & begin_big ) {
// pre: all weights in [0,begin_big) are at most average.
// all slots in [0,slot) are fine.
// post: all weights in [0,begin_big) are at most average.
// all slots in [0,slot] are fine.
// side: begin_big does not decrease but may increase.
while( ( the_index( slot ) == the_size() )
&&
( the_weight( slot ) < total ) ) {
// not yet fixed and in need of a complement
while ( the_weight( begin_big ) <= total ) {
++ begin_big;
}
the_index( slot ) = begin_big;
the_weight( begin_big ) -= ( total - the_weight( slot ) );
slot = begin_big;
}
}

// ==================================================
// | random choice |
// ==================================================

template < typename RNG >
inline
IndexType grab ( RNG & rng ) const {
/*
| * Pick a random index according to the
| originally specified weights.
| * This clearly is a constant time routine: No looping at all.
| However, we need to calls to the RNG.
*/
IndexType slot ( rng( the_size() ) );
if ( rng( total ) < the_weight( slot ) ) {
return( slot );
} else {
return( the_index( slot ) );
}
}

public:

WalkerAlias ( const WeightVector & weights ) :
total ( 0 )
{
setup_table( weights );
}

~WalkerAlias ( void ) {
}

template < typename RNG >
inline
IndexType operator() ( RNG & rng ) const {
return( grab( rng ) );
}

}; // WalkerAlias


#include <iostream>
#include <cstdlib>

class CstdlibRandomNumberGenerator {
private:

static
const unsigned long rand_max = RAND_MAX;

public:

CstdlibRandomNumberGenerator ( void ) {}

CstdlibRandomNumberGenerator ( unsigned long seed ) {
std::srand( seed );
}

unsigned long operator() ( unsigned long bound ) {
/*
| We will use the higher order bits of rand().
| Moreover, we avoid floating point arithmetic.
| The cost is that we might have to call rand()
| multiple times.
*/
unsigned long int int_width = rand_max / bound;
unsigned long int max_valid = int_width * bound;
/*
| max_valid is at least rand_max / 2.
|
| The following loop will be traversed
| rand_max/max_valid < 2 times on the average:
*/
while ( true ) {
unsigned long int raw_random ( std::rand() );
if ( raw_random < max_valid ) {
return( raw_random / int_width );
}
}
}

}; // CstdlibRandomNumberGenerator

int main ( void ) {
CstdlibRandomNumberGenerator rng ( 129342 );
WalkerAlias::WeightVector weights;
weights.push_back( 5 );
weights.push_back( 12 );
weights.push_back( 9 );
WalkerAlias alias ( weights );
for ( unsigned long i = 0; i < 26000; ++i ) {
std::cout << alias( rng ) << std::endl;
}
}

// end of file


Best

Kai-Uwe
 
A

AngleWyrm

The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html
Kai-Uwe Bux said:
Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

UNIX :)
Wound up downloading a uniq port from sourceforge just to perform the check in
DOS. If anyone is interested in such command-line utils for DOS, they can be had
at:
http://sourceforge.net/project/showfiles.php?group_id=23617&package_id=26492

I just got a copy of "Random Variables With General Distributions" by ALASTAIR
J. WALKER, but the original OCR scans at the ACM are buggy, and also written in
FORTRAN, with single-letter variables. I'm currently translating it to C++ so
that I can see what he did first-hand, because I have to decipher FORTRAN,
rather than read it. From what I've seen so far, it appears that his algorithm
will make it quite possible to use floating point weights, which has been
requested as a feature.

I haven't gotten to the magic-table part yet, but it's very interesting to me.

Don't yet have a copy of TAOCP (I know, sinner that I am) but I'm gonna have to
pick one up soon.
 
A

AngleWyrm

The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html
Kai-Uwe Bux said:
Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

UNIX :)
Wound up downloading a uniq port from sourceforge just to perform the check in
DOS. If anyone is interested in such command-line utils for DOS, they can be had
at:
http://sourceforge.net/project/showfiles.php?group_id=23617&package_id=26492

I just got a copy of "Random Variables With General Distributions" by ALASTAIR
J. WALKER, but the original OCR scans at the ACM are buggy, and also written in
FORTRAN, with single-letter variables. I'm currently translating it to C++ so
that I can see what he did first-hand, because I have to decipher FORTRAN,
rather than read it. From what I've seen so far, it appears that his algorithm
will make it quite possible to use floating point weights, which has been
requested as a feature.

I haven't gotten to the magic-table part yet, but it's very interesting to me.

Don't yet have a copy of TAOCP (I know, sinner that I am) but I'm gonna have to
pick one up soon.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,172
Messages
2,570,934
Members
47,477
Latest member
ColumbusMa

Latest Threads

Top