W
Walter Roberson
I am hoping someone can point me to an existing module or give
a good algorithm hint for this.
I process large-ish text files, performing a bit of work on each line.
The processing of each line can often be done mostly independantly of
the other lines, but it would be better if all the output for a given
line were to proceed the output for the next line. Some manner of
internal buffering for that purpose would be fine.
The per-line cpu time is short, by the way: but many lines call for DNS
lookups, and I'd like to do some of those lookups in parallel while
preferably maintining the output order -as if- the lines were processed
sequentially (assuming the processing of each is independant.)
I started writing this up in terms of ordered queues of processing
requests, and callbacks, but before I got very far on that approach, I
realized that in theory I could instead do something akin to having a
set of filehandles, barely change the existing code, and do some
behind-the-scenes work so that what was printed to any one filehandle
did not get sent to stdout until all the previous filehandles had
terminated.
Generalizing, I can see that this same mechanism could come up in other
contexts -- that one might want to start threads that execute
independantly, with a thread-localized filehandle that could be written
to by each thread, with the outputs automatically being demultiplexed
into a single filehandle so that all output from threads started
earlier appeared before the threads started later. Would someone have
some ideas on good ways to accomplish this?
While writing the previous paragraph, I realized that in the program
I'm working on most immediately, that output is always the last thing
the thread would do, after the "work" is done, so what I could do is,
just before output, queue on a semaphore that will not be made
available until the previous thread finishes. But I could imagine
other programs with intermediate outputs.
I will not be able to literally use perl threads; threaded
perl fails one of it's build tests on my IRIX system.
("known problem" "put here to stop you from putting into
production.) I guess there's always forking.
a good algorithm hint for this.
I process large-ish text files, performing a bit of work on each line.
The processing of each line can often be done mostly independantly of
the other lines, but it would be better if all the output for a given
line were to proceed the output for the next line. Some manner of
internal buffering for that purpose would be fine.
The per-line cpu time is short, by the way: but many lines call for DNS
lookups, and I'd like to do some of those lookups in parallel while
preferably maintining the output order -as if- the lines were processed
sequentially (assuming the processing of each is independant.)
I started writing this up in terms of ordered queues of processing
requests, and callbacks, but before I got very far on that approach, I
realized that in theory I could instead do something akin to having a
set of filehandles, barely change the existing code, and do some
behind-the-scenes work so that what was printed to any one filehandle
did not get sent to stdout until all the previous filehandles had
terminated.
Generalizing, I can see that this same mechanism could come up in other
contexts -- that one might want to start threads that execute
independantly, with a thread-localized filehandle that could be written
to by each thread, with the outputs automatically being demultiplexed
into a single filehandle so that all output from threads started
earlier appeared before the threads started later. Would someone have
some ideas on good ways to accomplish this?
While writing the previous paragraph, I realized that in the program
I'm working on most immediately, that output is always the last thing
the thread would do, after the "work" is done, so what I could do is,
just before output, queue on a semaphore that will not be made
available until the previous thread finishes. But I could imagine
other programs with intermediate outputs.
I will not be able to literally use perl threads; threaded
perl fails one of it's build tests on my IRIX system.
("known problem" "put here to stop you from putting into
production.) I guess there's always forking.