Random weighted selection...

Pat · May 10, 2009

Learned Experts,

I remember from many years ago a simple, elegant couple of lines of
code that would make a choice of a series of positions, effectively,
based on the weight of each position. OK, that was lame. Let me
explain the situation:

There are any number of jobs in a queue or list that each have a
priority. If the jobs were iterated based on priority, the low
priority jobs might never be done, or certainly not done in time.
After all, they have to wait for the higher priority jobs to pop off
the stack.

What is needed, therefore, is a way to allow the higher priority jobs
to be chosen _more_often_ than the lower priority ones. So instead of
purely serial, it is a choosing method more like a random weighted
selection (ergo the subject).

So I need a way to have my jobs in a queue (effectively, not
necessarily literally) where they are waiting for this function to
pull them off for completion. I've done a lot of searching for this in
any language, and can't find anything as short and elegant as the
routine I remember from so many years ago.

Has anyone ever heard of/seen something liek this in theory or
practice?

TIA!

Seamus MacRae · May 10, 2009

Pat said:
So I need a way to have my jobs in a queue (effectively, not
necessarily literally) where they are waiting for this function to
pull them off for completion. I've done a lot of searching for this in
any language, and can't find anything as short and elegant as the
routine I remember from so many years ago.

Has anyone ever heard of/seen something liek this in theory or
practice?

One way to do it is: when a new job is to begin, sum the job priorities,
generate a random number uniformly distributed between 0 and one less
than that sum, then iterate over the jobs again totting up their
priorities until the total exceeds the random number, then execute the
last job visited in the iteration.

This does not give more recent jobs priority over less. If jobs are
appended to the tail of the list, you can add a random chance to run the
current job each iteration, perhaps scaled to the queue size, or just a
random chance to run the frontmost job (which will have been waiting the
longest).

There are other ways to manage queues, many of them without random factors.

One is to use two or more threads like supermarket counters, with some
of them express lines. When these are free to take on a new job they
pick the next high-enough-priority one from the job queue. The others
take from the head of the queue always, instead of skipping low-priority
jobs. The "express lines" may vary as to the priority threshold for
accepting a job.

A simple variant has a single express line that accepts the highest
priority job currently on the queue for its next job and a normal line
that accepts the frontmost job on the queue, regardless of its priority;
this one works well for CPU-bound tasks on dual-core machines when you
want high-priority jobs done ASAP but every job guaranteed to run
eventually.

Another one is to use a single priority queue and a single job consumer;
every so often something grovels over the entire queue and increments
the priority of every job. The longer a job's been waiting the higher
its priority gets. Thus a low-priority (initially) job will not be
starved by a constant influx of high-priority jobs, unless the latter
have ever-increasing priorities; the one job's priority will eventually
grow to exceed that of the incoming jobs and it will then get done.

Last but not least, you could completely parallelize: for every new job,
create a new thread to start executing it immediately. All of the jobs
sleep from time to time, and the sleep durations are bigger the lower
the job priorities are. The OS scheduler then winds up apportioning more
time to the higher-priority jobs because the lower-priority ones yield
the CPU for longer intervals. (All the threads should get the SAME
priority as the OS scheduler sees it. Truly urgent jobs can be given a
higher thread priority to completely pre-empt the others. This one also
works best for CPU-bound jobs.)

If the jobs have well-defined deadlines and estimatable durations, an
alternative to simple queueing is to try to schedule the jobs like
appointments or TV shows, with some algorithm finding a best fit of the
jobs in (possibly one-per-CPU) queues such that they should complete
before their deadlines. The constraints might not be satisfiable, so
there should be some "best fit", such as minimizing a weighted
least-squares of estimated job completion latenesses. (Weighted, because
some jobs missing their deadlines even by a minute might be deemed worse
than others missing theirs by however long.) This might be an NP hard
problem however -- I know that some closely related problems are
(box-packing, for one). It won't matter though if there won't be too
many jobs to keep track of at once. Newly-arriving jobs can be handled
in either of two ways: rescheduling everything not already in progress,
or fitting it in as best as possible without moving any
already-scheduled jobs around. The former is more work but may yield
superior results, especially if jobs sometimes suddenly arrive with
short-fuse deadlines and high weights. (A job could be forced to run
ASAP by literally giving it a deadline of "do it yesterday" in this
case! Especially if the penalty for lateness that the scheduler seeks to
minimize grows supralinearly. The above suggestion of using the square
of lateness suffices. The penalty for a job completed early or on time
always being zero of course.)

Seamus MacRae · May 10, 2009

Seamus said:
One way to do it is: when a new job is to begin, sum the job priorities,
generate a random number uniformly distributed between 0 and one less
than that sum, then iterate over the jobs again totting up their
priorities until the total exceeds the random number, then execute the
last job visited in the iteration.

Note: the above requires the job priorities to be positive. Zero or
negative priority jobs will accumulate to the head of the list and never
get run.

This does not give more recent jobs priority over less.

Or vice-versa. Which in your case is apparently more desirable.

A simple variant has a single express line that accepts the highest
priority job currently on the queue for its next job

perhaps breaking a tie by picking the least recently added of the jobs
tied for highest priority

and a normal line that accepts the frontmost job on the queue,
regardless of its priority; this one works well for CPU-bound tasks on
dual-core machines when you want high-priority jobs done ASAP but every
job guaranteed to run eventually.

This by the way has the interesting property that no job will be kept
waiting to execute longer than with a simple FIFO queue on a single-CPU
machine of the same clock speed. Lowest-priority jobs take an average of
twice as long to get run than with a simple FIFO queue on the dual-core
machine, while highest-priority jobs run as soon as the current job
being run by the "express consumer" is done. They can be made to run
immediately with added sophistication: for each job priority (possible
or actual) but the lowest there is a thread of proportional priority
that consumes jobs of that priority from the queue and has affinity for
one core; one more thread consumes jobs from the queue's head regardless
of priority and has affinity for the other core. No job of priority N
waits for a job of priority N-1 or less; if no other job of priority N
or more is queued or running, it runs immediately on the "high priority
core" and pre-empts other jobs. The downside here is once a job is
started by one of the priority-based threads, it can get starved by an
influx of higher-priority jobs. Further sophistication might allow a job
starved for long enough to be moved to the other core, or rolled back
and reinserted into the queue to have a chance to be run by that core
(or when the glut of high-priority jobs has passed, if that happens soon
enough).

If the thread scheduler is sophisticated enough, one might just have a
thread for each priority, with no affinity, but all with the same thread
priority on one core and with increasing priorities on the other. This
requires a scheduler that lets threads have per-CPU priorities instead
of a global one. The "low priority core" won't run the jobs it runs in
strict FIFO order, but timeshare a few at a time, and jobs added sooner
will complete sooner and all will complete eventually. Indeed, if all
the higher-priority jobs are done and a while passes without any new
ones, a job may then even start running on the other core and be done
swiftly.

I don't know of such a scheduler, and in a Java application, it would
depend on the host OS. One way to fake it would be to implement jobs in
a coroutine-style: they are Objects with a doABitMore() method that
returns true (or maybe a non-null result object encapsulating one or
more return values) when the job's done, and with the property that one
invocation of doABitMore() does very little. This adds some overhead
(the more the finer the time-slice granularity of the doABitMore()
methods) but lets you write your own scheduler. It can have two threads,
a job queue, and the threads calling doABitMore() on (respectively) the
highest-priority job and the frontmost job. (The job has to be taken out
of the queue before doABitMore() is called, and put back in its place
after this returns if the job's not done; this is probably best done not
by actually remove()ing it but by flagging it in some way, perhaps by
setting a volatile boolean.)

Needless to say, the above works well for CPU-bound jobs, but not so
well for I/O jobs. (Print jobs especially -- you'll get pages, or worse,
PARTS of pages, interleaved!)

If the jobs have well-defined deadlines and estimatable durations, an
alternative to simple queueing is to try to schedule the jobs like
appointments or TV shows, with some algorithm finding a best fit of the
jobs in (possibly one-per-CPU)

or printer, or server in the farm, or other such resource

queues such that they should complete before their deadlines.

One more possible area of sophistication is in favoring shorter jobs (if
there's some reasonably reliable estimate of job size -- particularly,
it should never think a job will be brief that ends up taking ages;
there should be an upper bound on how wrong it will be in proportion to
the estimate, e.g. "a job will take no more than 50% longer than
estimated"). The estimate itself might best be an upper bound on job
duration, if you can manage this.

Short jobs can then be given a priority boost, or an express line of
their own. The latter can be especially useful with office printers,
where people want to be able to print short things (a page or two) and
get them immediately, but sometimes very long reports or sheafs of
legalese or the like get enqueued. Having all jobs of over 10 pages,
say, routed to a particular printer or printers and all others to other
printers avoids a short and urgent job getting stuck behind the
equivalent of a slow-moving truck on a highway with no passing lanes.

With job length and priority both factors, you might have four (or more)
queues in a two-dimensional space based on those factors, or use job
length as a weight that modulates job priority. (Time the job has
already been waiting was already noted as another possible weight to
modulate job priority, earlier.)

Tom Anderson · May 10, 2009

So I need a way to have my jobs in a queue (effectively, not necessarily
literally) where they are waiting for this function to pull them off for
completion. I've done a lot of searching for this in any language, and
can't find anything as short and elegant as the routine I remember from
so many years ago.

I can see how to do it - it's pretty trivial - but i wouldn'd describe it
as elegant. If your jobs looks like:

interface Job extends Runnable {
public int priority();
}

Then:

Job pickJob(List<Job> jobs, Random rnd) {
int totalPriority = 0;
for (Job job: jobs) totalPriority += job.priority();
int index = rnd.nextInt(totalPriority);
for (Job job: jobs) {
index -= job.priority();
if (index < 0) return job;
}
assert false; return null; // should be unreachable!
}

Another approach would be to put the jobs in priority order, then pick
from the list in such a way that earlier items are more likely to be
picked. You can do this easily by walking the list, and picking the
current item with a given fixed probability. If that was 50%, then you'd
have a 50% chance of picking the first item, 25% of picking the second,
12.5% of picking the third, etc. This doesn't assign equal probability to
jobs of equal priority, but if you're careful, it does assign greater
probability to jobs submitted earlier, which could be an advantage.

class QueuedJob implements Comparable<QueuedJob> {
public final Job job;
public final int serialNumber;
public QueuedJob(Job job, int serialNumber) {
this.job = job
this.serialNumber = serialNumber;
}
public int compareTo(QueuedJob that) {
int difference = -compare(this.job.priority(), that.job.priority());
if (difference == 0) difference = compare(this.serialNumber, that.serialNumber);
return difference;
}
private int compare(Integer a, Integer b) {
return a.compareto(b);
}
public boolean equals(Object obj) {
if ((obj == null) || !(obj instanceof QueuedJob)) return false;
QueuedJob that = (QueuedJob)obj;
return (this.job.equals(that.job)) && (this.serialNumber == that.serialNumber);
}
}

class JobQueue {
private SortedSet<QueuedJob> qjobs = new TreeSet<QueuedJob>();
private int counter = 0;
private Random rnd = new Random();
private float p = 1.0 / 3.0; // or whatever
public void enqueue(Job job) {
qjobs.add(new QueuedJob(job, counter));
++counter; // ignore obvious problem with wrapping
}
public Job pop() {
Iterator<QueuedJob> it = qjobs.iterator();
while (it.hasNext()) {
if (rnd.nextFloat() <= p) return pop(it);
}
// might not picky any of the above, so:
return pop(qjobs.iterator());
}
private Job pop(Iterator<QueuedJob> it) {
Job job = it.next().job;
it.remove();
remove job;
}
}

tom

Eric Sosman · May 10, 2009

Pat said:
Learned Experts,

I remember from many years ago a simple, elegant couple of lines of
code that would make a choice of a series of positions, effectively,
based on the weight of each position. OK, that was lame. Let me
explain the situation:

There are any number of jobs in a queue or list that each have a
priority. If the jobs were iterated based on priority, the low
priority jobs might never be done, or certainly not done in time.
After all, they have to wait for the higher priority jobs to pop off
the stack.

If the priorities have been assigned well and reasonably,
this might not be a bad thing. "I see you've stopped breathing
and need CPR right away. Attending to your predicament has a high
priority, but I'm going to tweeze my eyebrows instead -- it's less
important, but it's been postponed *way* too long ..."

What is needed, therefore, is a way to allow the higher priority jobs
to be chosen _more_often_ than the lower priority ones. So instead of
purely serial, it is a choosing method more like a random weighted
selection (ergo the subject).

So I need a way to have my jobs in a queue (effectively, not
necessarily literally) where they are waiting for this function to
pull them off for completion. I've done a lot of searching for this in
any language, and can't find anything as short and elegant as the
routine I remember from so many years ago.

Others have described ways of making the randomized
selection you asked for, but I'm going to suggest a different
way to attack the larger problem. It's not an equivalent
method, but it's simple and might suit your needs.

The idea is to let the priority of a queued task increase
as the task "ages," so that eventually even a task that had
low priority to start with gets chosen ahead of a supposedly
higher-priority task. Tasks with high (initial) priority tend
to get chosen sooner than those with low (initial) priority,
but eventually even the lowliest long-waiting task will outrank
a newly-arrived urgent task.

As described this may sound inefficient: Each time you
pulled a task from the queue you'd need to run through all
the remaining tasks and bump their priorities, which is not
attractive if the queue is long. But there's an equivalent
and simpler way: Adjust each newly-inserted task's priority
by adding a bias that decreases with each insertion:

private int bias = Integer.MAX_VALUE - MAX_PRIORITY;
void queueNewTask(Task t) {
queue.add(t.getPriority() + bias, t);
--bias;
}

That is, instead of raising the priorities of long-waiting
tasks, you lower the priorities of newly-arrived tasks to get
the same effect.

If you process more than four billion or so insertions
there's a danger that the bias could wrap around from negative
to positive. If this is a concern, you could add code to
notice the wrap-around, reset the bias to its original high
value, and adjust the priorities of already-queued tasks --
this is the inefficiency you wanted to avoid, but it only
happens once every four billion times. Or you could use a
`long' bias and be safe for eighteen quintillion insertions.

Tom Anderson · May 10, 2009

Others have described ways of making the randomized selection you
asked for, but I'm going to suggest a different way to attack the larger
problem. It's not an equivalent method, but it's simple and might suit
your needs.

The idea is to let the priority of a queued task increase as the task
"ages," so that eventually even a task that had low priority to start
with gets chosen ahead of a supposedly higher-priority task. Tasks with
high (initial) priority tend to get chosen sooner than those with low
(initial) priority, but eventually even the lowliest long-waiting task
will outrank a newly-arrived urgent task.

As described this may sound inefficient: Each time you pulled a task
from the queue you'd need to run through all the remaining tasks and
bump their priorities, which is not attractive if the queue is long.
But there's an equivalent and simpler way: Adjust each newly-inserted
task's priority by adding a bias that decreases with each insertion:

private int bias = Integer.MAX_VALUE - MAX_PRIORITY;
void queueNewTask(Task t) {
queue.add(t.getPriority() + bias, t);
--bias;
}

That is, instead of raising the priorities of long-waiting tasks, you
lower the priorities of newly-arrived tasks to get the same effect.

If you process more than four billion or so insertions there's a
danger that the bias could wrap around from negative to positive. If
this is a concern, you could add code to notice the wrap-around, reset
the bias to its original high value, and adjust the priorities of
already-queued tasks -- this is the inefficiency you wanted to avoid,
but it only happens once every four billion times. Or you could use a
`long' bias and be safe for eighteen quintillion insertions.

Or perhaps use serial number arithmetic, which is specifically designed to
deal with wrapping:

http://tools.ietf.org/rfc/rfc1982.txt

tom

Roedy Green · May 10, 2009

Learned Experts,

I remember from many years ago a simple, elegant couple of lines of
code that would make a choice of a series of positions, effectively,
based on the weight of each position. OK, that was lame. Let me
explain the situation:

I use code like that to select a random daily quotation from a number
of categories, where the categories are given relative weights.

The mathematical guts of it are posted at
https://wush.net/websvn/mindprod/fi...path=/com/mindprod/htmlmacros/Randomiser.java
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Species evolve exactly as if they were adapting as best they could to a changing world, and not at all as if they were moving toward a set goal."
~ George Gaylord Simpson

Seamus MacRae · May 10, 2009

Eric said:
Others have described ways of making the randomized
selection you asked for, but I'm going to suggest a different
way to attack the larger problem. It's not an equivalent
method, but it's simple and might suit your needs.

The idea is to let the priority of a queued task increase
as the task "ages," so that eventually even a task that had
low priority to start with gets chosen ahead of a supposedly
higher-priority task.

That is also a suggestion that "others have described", actually -- I
did. I also described an algorithm similar to the one implemented in
Tom's post (though he gets points for posting actual code). And I
described several other related ideas.

And I just checked that my post did propagate -- Google has it archived.

If your newsserver didn't receive it, you can view it at
http://groups.google.com/group/comp.lang.java.programmer/msg/733a3eb8f5f39135
and an addendum at
http://groups.google.com/group/comp.lang.java.programmer/msg/de158eeb233ff73c

Of course if your newsserver doesn't receive this post either, that
won't help you much.

I second the suggestion that the OP might want to read up on scheduling
algorithms at sites like Wikipedia; a lot of time and bother can be
saved by not reinventing the wheel.

Pat · May 12, 2009

Good People,

I am honestly overwhelmed by the time you have each taken to respond.
I was/am completely unprepared! When I saw 9 messages in this thread,
I thought they all would be chiding me. I am pleasantly surprised by
your time and generosity.

Firstly, I need to sit down and really read the responses. I wanted to
post right away to allay any fears that I wouldn't respond in a timely
fashion. I just need a little time to digest and respond to each
poster as appropriate. I will do this.

Secondly, as you have all given this serious thought, I should refine
the requirement for you, as it may lead to a more interesting
discussion. I am certain that you have all basically "solved" the
problem (well, it's not really a problem, per se) already, so I am
posting a refinement _not_ to ask for further assistance. I am posting
more info FYI, if anything.

It is a real problem, and I am designing a real product. This is no
homework assignment, to be sure. That said, the purpose is to provide
work for grid nodes. I will have n number of clients/nodes asking the
server for work/tasks. I am centrally distributing all the server work
that can be done by grid nodes out to those nodes. It is that simple.
I used to work for a leader in the grid business, and I saw
opportunity in my product design to create/use a grid. That work gave
me insight as to how complex grid work/jobs can be described/managed.
Having such an appreciation also showed me a completely different way
to utilize grids, and this random weighting is only one of the ways I
want to do things differently that my past employer.

The work for the nodes is varied. It can be going out and scraping a
web page, doing a mathematical calculation, or parsing text against a
dictionary, for example.

"Priority" is a loose cover term for all the components of priority.
There is declared priority from the sender's (or task's) perspective
(only considers the task itself), and there is priority from the queue
manager's perspective (takes everything into account). The priority
then extends to the receiver/node, and finally the submission of the
results to the server/sender, and how those results are processed.

But this is only for "priority" in general. There are several types of
priority, each falling under the functional blocks/flow described
above. and yes, each and every priority can be adjusted en route.
Things I have to take into account include:

- Moving average of task type's completion
- Network lag to/from the server
- Network lag to/from the target (from the node)
- Node characteristics (think "health", as in CPU, RAM, I/O, etc.)
both as reported and when task is due for execution on node
- Time/Date task must not be returned later than
- Time/Date the task must wait for before starting
- Action upon expiration (delete, keep waiting, wait without priority
escalation, etc.)
- # of assignment retries on node and server
- # of error retries on node and server
- Task failure action(s)
- Predecessor tasks this task must wait for
- Preceding task this task needs to be executed immediately after
- Re-assignment options (move task waiting on one node to another
node)
- Execution requirements (CPU, RAM, I/O, idle time, etc.)
- All the priorities and metadata that apply for each node
(environmental, computational, etc.)
- ...and there are quite a few more

So as I see it, all of these factors, when combined with environmental
conditions, contribute to a priority. I believe it will be impossible
to assign one number to such a complex thing and call that # "The
Priority". Nor do I believe that a change in one metadata priority
will necessitate changing any/all others. Just dealing with this logic
between metadata requirements/restrictions is a job in itself.

So the methodology is as such:

A queue of work exists on the server. It is comprised of tasks. These
tasks are unassigned.

A process chooses appropriate tasks from the queue for inclusion into
a "Job", or collection of tasks, bound for a known, available, active
node. It is this process, and a few more, that I was asking about here
initially. The selection of tasks in a large queue, weighted by their
existing priority.

Anyway, that job is sent to the node for execution. The node iterates
through the job, doing each task. After a set period, it returns what
results it has, and continues until all tasks are completed. If any
tasks timeout, then existing rules determine how the situation is
handled. The node may, for example, retain the asynchronous jobs and
ask for another job, adding what it receives to the existing tasks.

When the job is returned to the server, the server saves the results,
and the process continues. I will say this, especially since threading
was mentioned. The server is J2EE.

Again, let me express my gratitude at your responses already. I will
prepare a more thorough response shortly.

Thanks!

pat

Bâ˜¼gus Exceptiâ˜¼n · May 27, 2009

Learned Experts,

Click to expand...

I remember from many years ago a simple, elegant couple of lines of
code that would make a choice of a series of positions, effectively,
based on the weight of each position. OK, that was lame. Let me
explain the situation:

Click to expand...

[...]

No. It's difficult for me to imagine anyone implementing a serious task
scheduler based on random number selection.

Peter,

I know you had trouble with the rationale for this, but it is valid. I
was just wondering if you knew the code that did it.

Typically, a priority-based system would initially give high-priority
[...]

If you want to write some kind of prioritized scheduler, then you should
read up on scheduler design. I'm sure there's a wealth of information on

[...]

You took a while to write up a critique as to why you thought my idea
was wrong/bad, and then pointed me to the internet. Even if it would
have helped, it is obvious that I could not convince you that what I
am trying to do is valid.

We risk ridicule when we _explain_ why we're looking for something. It
is helpful in these situations to separate the "How" from the "Why".
You don't agree with the "Why", so perhaps things would have been
easier if you'd simply kept things to the "How"?

As an aside, I do appreciate you trying to dissuade me from my task,
but I can assure you that it is in fact a good idea, and at worst it
is at least "not bad". Thanks for the theoretical critique, and
poitning me to the internet.

pat

Bâ˜¼gus Exceptiâ˜¼n · May 27, 2009

Seamus,

Thanks for writing.

One way to do it is: when a new job is to begin, sum the job priorities,
generate a random number uniformly distributed between 0 and one less
than that sum, then iterate over the jobs again totting up their
priorities until the total exceeds the random number, then execute the
last job visited in the iteration.

This is the desired effect. Unfortunately it requires too many steps,
and lets face it, isn't elegant. It would, however, result in the
desired choosing.

This does not give more recent jobs priority over less. If jobs are
appended to the tail of the list, you can add a random chance to run the
current job each iteration, perhaps scaled to the queue size, or just a
random chance to run the frontmost job (which will have been waiting the
longest).

Definitely an offshoot of the multi-step approach. The reason it is
needed aside, I am still searching for that 1-2 liner.

There are other ways to manage queues, many of them without random factors.

One is to use two or more threads like supermarket counters, with some
of them express lines. When these are free to take on a new job they
pick the next high-enough-priority one from the job queue. The others
take from the head of the queue always, instead of skipping low-priority
jobs. The "express lines" may vary as to the priority threshold for
accepting a job.

This is what I'll be doing. No need for multiple threads, as I can
kick them off with timers. Priorities need to be continually re-
assessed. The obvious problem with a flat prioritization mechanism
that treats jobs serially is that everything goes on the 'front
burner'. If everything is on the front burner, then there is no back
burner. This is the flaw in serial prioritization.

Also, we're not talking about slow processing, where there is lots of
time for processing, and the jobs aren't coming in too fast, execution
is quick, etc. The system I'm writing will be hammered, and this is
why issues like most of the ones brought up in this thread don't
apply.

A simple variant has a single express line that accepts the highest
priority job currently on the queue for its next job and a normal line
that accepts the frontmost job on the queue, regardless of its priority;
this one works well for CPU-bound tasks on dual-core machines when you
want high-priority jobs done ASAP but every job guaranteed to run
eventually.

This is a neat idea!

Another one is to use a single priority queue and a single job consumer;
every so often something grovels over the entire queue and increments

This is the same as the grocery line example above. I refereed to it
as re-prioritization.

Last but not least, you could completely parallelize: for every new job,
create a new thread to start executing it immediately. All of the jobs

Interesting, but most of the work will be throttled with timers on the
client. Additionally, the problem is task selection from a queue and
not execution. So as I'd described after this email was written, a
long list of tasks needs to have selected from it a finite number for
execution on a particular client. Think grid. It is that selection
process that I'm after with the algorithm.

Once on the client, they will be done per their existing priority.

If the jobs have well-defined deadlines and estimatable durations, an
alternative to simple queueing is to try to schedule the jobs like
appointments or TV shows, with some algorithm finding a best fit of the
jobs in (possibly one-per-CPU) queues such that they should complete
before their deadlines. The constraints might not be satisfiable, so
there should be some "best fit", such as minimizing a weighted

I will be tracking all that, and noting "estimated completion time"
and duration, for types of work, etc. I also take into account
retries, retry count, etc. But I'm afraid all of those fun plans are
taking away from the fact that I still can't find that little
selection algorithm.

Your post really made me think!

pat

Bâ˜¼gus Exceptiâ˜¼n · May 28, 2009

Tom,

Thanks for writing.

interface Job extends Runnable {
public int priority();

}

Then:

Job pickJob(List<Job> jobs, Random rnd) {
int totalPriority = 0;
for (Job job: jobs) totalPriority += job.priority();
int index = rnd.nextInt(totalPriority);
for (Job job: jobs) {
index -= job.priority();
if (index < 0) return job;
}
assert false; return null; // should be unreachable!

}

This seems to be what Seamus suggested, no?

Another approach would be to put the jobs in priority order, then pick
from the list in such a way that earlier items are more likely to be
picked.

This is exactly the nature of my initial post. A weighted random
choice, or choices. Well, as long as I understand your use of
"earlier".

You can do this easily by walking the list, and picking the
current item with a given fixed probability. If that was 50%, then you'd
have a 50% chance of picking the first item, 25% of picking the second,
12.5% of picking the third, etc. This doesn't assign equal probability to
jobs of equal priority, but if you're careful, it does assign greater
probability to jobs submitted earlier, which could be an advantage.

I think the priority re-assessor would take care of all that. THe
choice is what is important.

class QueuedJob implements Comparable<QueuedJob> {
public final Job job;
public final int serialNumber;
public QueuedJob(Job job, int serialNumber) {
this.job = job
this.serialNumber = serialNumber;
}
public int compareTo(QueuedJob that) {
int difference = -compare(this.job.priority(), that.job.priority());
if (difference == 0) difference = compare(this.serialNumber, that.serialNumber);
return difference;
}
private int compare(Integer a, Integer b) {
return a.compareto(b);
}
public boolean equals(Object obj) {
if ((obj == null) || !(obj instanceof QueuedJob)) return false;
QueuedJob that = (QueuedJob)obj;
return (this.job.equals(that.job)) && (this.serialNumber == that.serialNumber);
}

}

class JobQueue {
private SortedSet<QueuedJob> qjobs = new TreeSet<QueuedJob>();
private int counter = 0;
private Random rnd = new Random();
private float p = 1.0 / 3.0; // or whatever
public void enqueue(Job job) {
qjobs.add(new QueuedJob(job, counter));
++counter; // ignore obvious problem with wrapping
}
public Job pop() {
Iterator<QueuedJob> it = qjobs.iterator();
while (it.hasNext()) {
if (rnd.nextFloat() <= p) return pop(it);
}
// might not picky any of the above, so:
return pop(qjobs.iterator());
}
private Job pop(Iterator<QueuedJob> it) {
Job job = it.next().job;
it.remove();
remove job;
}

}

Tom, is there an alternative to the costly and time consuming
iteration? This was not part of what I had recalled in the initial
post in this thread, but...

What is the jobs were in an ArrayList, or other Collection(), and the
basic parameters were known about the list/collection/array.

You know what is easy/fast to learn:

The total # of elements
The highest priority
The lowest priority
How many elements will be selected (pulled out of the array)

Why couldn't you do _one_ iteration, and make your choice as to
whether to include an item at the same time that item comes up in the
iteration? Would sorting/ordering even be necessary? Without testing,
that would seem to be the least task intensive way to make a selection-
even if not a random weighted one.

Hmm... Your example made me think of this. Not sure if it holds water,
though!

Thanks!

pat

Bâ˜¼gus Exceptiâ˜¼n · May 28, 2009

Eric,

Thanks for writing.

If the priorities have been assigned well and reasonably,
this might not be a bad thing. "I see you've stopped breathing
and need CPR right away. Attending to your predicament has a high
priority, but I'm going to tweeze my eyebrows instead -- it's less
important, but it's been postponed *way* too long ..."

Well, that is a pretty bad scenario!

Prioritization is key, to be
sure. To answer your point as to whether prioritization was done well
and reasonably, I did show the dimensions of prioritization I'm going
to at least use.

What a weighted random queueing methodology/design avoids is the
problem where tasks get stuck and clump around either the back burner
or the front burner. Too many at either extreme makes a serial
queueing mechanism fail.

After this email was written, I hopefully conveyed that there is
anticipated many workers. Some models work well with one/few workers,
and others work well with a larger # of workers. I feel that random
weighted queueing works extremely well for many workers. I know I'm
back to justifying my design instead of discussing the code. There are
2 discussions; "Why" and "How". "Why" is fun to talk about, but the
"How" is the path out.

Others have described ways of making the randomized
selection you asked for, but I'm going to suggest a different
way to attack the larger problem. It's not an equivalent
method, but it's simple and might suit your needs.

The idea is to let the priority of a queued task increase
as the task "ages," so that eventually even a task that had
low priority to start with gets chosen ahead of a supposedly
higher-priority task. Tasks with high (initial) priority tend
to get chosen sooner than those with low (initial) priority,
but eventually even the lowliest long-waiting task will outrank
a newly-arrived urgent task.

Yes. I see what you mean. Still, this ends up sounding like an
infinitely growing 'front burner', no? But don't get me wrong. The
system you describe would be terrific for setting the stage for
selection, right? Proper prioritization (and I showed that I'll be
using several priority dimensions) would make any selection algorithm
more effective.

As described this may sound inefficient: Each time you
pulled a task from the queue you'd need to run through all
the remaining tasks and bump their priorities, which is not
attractive if the queue is long. But there's an equivalent
and simpler way: Adjust each newly-inserted task's priority
by adding a bias that decreases with each insertion:

private int bias = Integer.MAX_VALUE - MAX_PRIORITY;
void queueNewTask(Task t) {
queue.add(t.getPriority() + bias, t);
--bias;
}

That is, instead of raising the priorities of long-waiting
tasks, you lower the priorities of newly-arrived tasks to get
the same effect.

This is really nice. I can't help but feel that a periodic re-assessor
of priorities would step on the initially established priority. Also,
I would seek a way to manage the bias so that it was incremented
positively under the right conditions. Such a case would be dependent
on several factors, none the least of which being the delta between
the MAX celing and the highest current priority.

A very interesting idea, though, for managing one priority dimension.

If you process more than four billion or so insertions
there's a danger that the bias could wrap around from negative
to positive. If this is a concern, you could add code to
notice the wrap-around, reset the bias to its original high
value, and adjust the priorities of already-queued tasks --
this is the inefficiency you wanted to avoid, but it only
happens once every four billion times. Or you could use a
`long' bias and be safe for eighteen quintillion insertions.

That's what BigInteger is for, right?!

Thanks!

pat

Bâ˜¼gus Exceptiâ˜¼n · May 28, 2009

Roedy,

Thanks for writing. If I understand the code, it is a variation of
earlier posts whereby priorities are summed, correct?

I use code like that to select a random daily quotation from a number
of categories, where the categories are given relative weights.

Perhaps there is a way to avoid all that. Not as easy/direct/obvious a
solution as I had initially hoped.

Thanks!

pat

Seamus MacRae · May 28, 2009

Bâ˜¼gus Exceptiâ˜¼n said:
What is the jobs were in an ArrayList, or other Collection(), and the
basic parameters were known about the list/collection/array.

You know what is easy/fast to learn:

The total # of elements
The highest priority
The lowest priority
How many elements will be selected (pulled out of the array)

Why couldn't you do _one_ iteration, and make your choice as to
whether to include an item at the same time that item comes up in the
iteration? Would sorting/ordering even be necessary? Without testing,
that would seem to be the least task intensive way to make a selection-
even if not a random weighted one.

Hmm... Your example made me think of this. Not sure if it holds water,
though!

You need something like a heap or the PriorityQueue class in Java's
standard library.

Tom Anderson · May 29, 2009

This seems to be what Seamus suggested, no?
Yes.

Tom, is there an alternative to the costly and time consuming
iteration?

Yes - if you're talking about implementing the same probabilistic
mechanism as in the code immediately above. Instead of a TreeSet, use an
ArrayList, and keep it sorted, so that you have O(1) access to any
element. You then think about the maths: in the above scheme, every
element has a probability of being picked, and the probabilities for all
elements plus the possibility of none being picked sum to one (of course).
That is, there's a cumulative distribution over the options. You just have
to figure out what the function is, reverse it, then you can inject a
random number between 0 and 1, and get out an item, with exactly the same
probability distribution as with the iterative approach.

Having done a bit of scribbling on the back of an index card and a python
prompt, i think the magic function (where p is your probability of picking
at every step, and x is a random number from a uniform distribution
between 0 and 1) is:

index(x) = ceil(log_p(x))

And if index(x) >= number of items, then take it as 0.

Bear in mind that:

log_a(b) = log(b) / log(a)

Java doesn't have a function to do logarithms to an arbitrary base, but
that identity will let you compute them.

This was not part of what I had recalled in the initial post in this
thread, but...

What is the jobs were in an ArrayList, or other Collection(), and the
basic parameters were known about the list/collection/array.

You know what is easy/fast to learn:

The total # of elements
The highest priority
The lowest priority
How many elements will be selected (pulled out of the array)

Why couldn't you do _one_ iteration, and make your choice as to
whether to include an item at the same time that item comes up in the
iteration?

If you need to pick several items, rather than just one, then yes, you
could do them all at once.

Would sorting/ordering even be necessary? Without testing, that would
seem to be the least task intensive way to make a selection- even if not
a random weighted one.

If you're following the first algorithm, as at the top of this post, the
jobs don't need to be sorted. You do need to iterate over them. You can
pick several jobs at once (although you have to be a bit careful not to
pick the same job more than once).

You could also do something clever with enfilade trees that would let you
do an O(log n) lookup instead of an O(n) iteration, but that's a whole
other story.

tom

Gene · May 29, 2009

Learned Experts,

I remember from many years ago a simple, elegant couple of lines of
code that would make a choice of a series of positions, effectively,
based on the weight of each position. OK, that was lame. Let me
explain the situation:

There are any number of jobs in a queue or list that each have a
priority. If the jobs were iterated based on priority, the low
priority jobs might never be done, or certainly not done in time.
After all, they have to wait for the higher priority jobs to pop off
the stack.

What is needed, therefore, is a way to allow the higher priority jobs
to be chosen _more_often_ than the lower priority ones. So instead of
purely serial, it is a choosing method more like a random weighted
selection (ergo the subject).

So I need a way to have my jobs in a queue (effectively, not
necessarily literally) where they are waiting for this function to
pull them off for completion. I've done a lot of searching for this in
any language, and can't find anything as short and elegant as the
routine I remember from so many years ago.

Has anyone ever heard of/seen something liek this in theory or
practice?

There's a weighted version of fair share scheduling. Let S > 0 be
the desired processing weight of the i'th job. Let T be the amount
of processor time already spent on the i'th job. Then you want to
execute the job k where the value

T[k] / S[k]

is the smallest.

Most pythonic way of weighted random selection	3	Aug 30, 2008
I'm tempted to quit out of frustration	1	Aug 13, 2023
Can't decide which language to get back into programming with	1	Mar 28, 2023
Weighted "random" selection from list of lists	4	Oct 8, 2005
Random selection	2	May 18, 2007
unit selection problem	1	Jan 14, 2013
Help wanted to modify Gimp Script-fu : will pay	0	Aug 26, 2022
implement random selection in Python	18	Nov 16, 2007

Random weighted selection...

Pat

Seamus MacRae

Seamus MacRae

Tom Anderson

Eric Sosman

Tom Anderson

Roedy Green

Seamus MacRae

Pat

Bâ˜¼gus Exceptiâ˜¼n

Bâ˜¼gus Exceptiâ˜¼n

Bâ˜¼gus Exceptiâ˜¼n

Bâ˜¼gus Exceptiâ˜¼n

Bâ˜¼gus Exceptiâ˜¼n

Seamus MacRae

Tom Anderson

Gene

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads