distributed initialisation code

R

Roedy Green

I wondered what sort of techniques you use for this sort of problem.

Lets say I have a program that processes hundreds of files. Each one
may be processed by some set of hundreds of processing modules that
may be called many times per file, for different parts of it.

In that processing code there is a need for:

one-time startup initialisation = handled by static init.

shut down = handled by Runtime.getRuntime().addShutdownHook

But what about some code that should be run every time a new file is
loaded.

Possible solutions

1. mix code from many different processors together in a newFileInit
method. All those variables must be public. This is very
unencapsulated.

2. Pass a boolean to every processor on every call to tell it if this
represents a new file. Most of the time, the boolean is of no
interest.

3. some kind of way of registering a callback that gets called on
reading a new file. Such a scheme might be easily extendable to
handle various conditions without disturbing existing code. It would
be like addShutdownHook but for more general conditions. Is there a
canned solution?
 
R

Roedy Green

3. some kind of way of registering a callback that gets called on
reading a new file. Such a scheme might be easily extendable to
handle various conditions without disturbing existing code. It would
be like addShutdownHook but for more general conditions. Is there a
canned solution?

4. have the processors call boolean methods to see if it is time to do
various special processing.
 
M

Marcel Müller

I wondered what sort of techniques you use for this sort of problem.

Lets say I have a program that processes hundreds of files. Each one
may be processed by some set of hundreds of processing modules that
may be called many times per file, for different parts of it.

In that processing code there is a need for:

one-time startup initialisation = handled by static init.

shut down = handled by Runtime.getRuntime().addShutdownHook

But what about some code that should be run every time a new file is
loaded.

Possible solutions

1. mix code from many different processors together in a newFileInit
method. All those variables must be public. This is very
unencapsulated.
?

2. Pass a boolean to every processor on every call to tell it if this
represents a new file. Most of the time, the boolean is of no
interest.

This may be a race condition. A processor might receive "new file" while
a second one gets another task from that file, but it has been not yet
completely initialized.

3. some kind of way of registering a callback that gets called on
reading a new file. Such a scheme might be easily extendable to
handle various conditions without disturbing existing code. It would
be like addShutdownHook but for more general conditions. Is there a
canned solution?

This is still a race condition for the same reason. OK, your callback
might be synchronized, but this might block several parallel workers for
some time.

I would recommend 4.:
Take the initialization as a processing task like any other of your
tasks but with one exception: the other tasks are not scheduled unless
this task has completed. You need a controller anyway that distributes
your processing modules over the available resources. This controller
might ensure this constraint. So at first only initialization Tasks are
scheduled. When one of these tasks completed the corresponding
processing tasks of this file are put in the queue.

Of course, now you have two kind of processing tasks. But I would not
implement this by a boolean. You have three type of objects.
- #1 representing the system resources, i.e. the worker threads.
- #2 the work items. They can be of different type, but with a common
interface. Let us call it ITask.
- #3 the scheduler or controller. This is a singleton and handles the
work item pool. This object need to be synchronized. Basically this is
only a queue, maybe with priorities.

So you program operates as follows
- Place at lease one initial task in the queue, i.e. a file
initialization task.
- Spawn the worker threads.
- Each worker thread calls some getWork() method of the controller
singleton. If it receives NULL, the worker thread terminates. If not, it
calls doWork() on the ITask interface. getWork might block if there is
currently nothing to do but the worker is likely to be reused later.
- doWork might be a file initializer. It does its initialization work
and places the required tasks for this file in the global task queue.

This concept is widely expansible. You might even have some shut down
task for each file. The initialization object for each file might have
done it's work already, but that does not mean that it no longer exists.
The child task it created may hold a reference to it, maybe simply by
using nested classes. And they might notify their parent when their work
is done. (The called method of the parent needs to be synchronized!) The
parent may count the child tasks and once the last one completed put a
final cleanup task in the global queue. The cleanup task might have
higher priority to get rid of memory objects more quickly. In this
special case the cleanup might be done in place because you are sure
that no one can wait for the synchronized method anymore. But be very
careful with things like this. Operating in synchronized context for
longer might cause serious trouble like deadlocks or simple no parallelism.


Marcel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top