speaking of forking -- parallel database fetches?

DJ Stunks · Nov 24, 2006

Hey all,

I have a question about performing two long-running selects from a
database in parallel. I've only written a few scripts which do things
in parallel so I'm not an expert by any means.

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete rather than pulling all the rows in the child, and serializing
and passing to the parent.

Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

TIA,
-jp

#!/usr/bin/perl <pseudocode>

use strict;
use warnings;

my @kids = (
{ query => 'select * from big_table' },
{ query => 'select * from another_big_table' },
);

my $pid;
for my $kid (@kids) {
$pid = open( my $fh, '|-');
die "Can't fork: $!\n" if not defined $pid;

@{ $kid }{ 'pid','handle' } = ($pid,$fh);
}

if ( $pid == 0) { # I'm one of the children

# connect to the db
# prepare query
# execute query
# wait for results
# foreach @row = $sth->fetchrow_array
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row
#
# while ( my $line = < $kids[1]{handle} >) {
# etc.
}

__END__

xhoster · Nov 24, 2006

DJ Stunks said:
Hey all,

I have a question about performing two long-running selects from a
database in parallel. I've only written a few scripts which do things
in parallel so I'm not an expert by any means.

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

How many rows are we talking?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete

I'm pretty sure that that ain't gonna happen.

rather than pulling all the rows in the child, and serializing
and passing to the parent.

Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

Parallel::Jobs might help, although it isn't very easy to use and wants to
run commands rather than Perl code (so you would have to fire up a new Perl
interpreter from scratch, and it wouldn't inherit variables, etc, from the
existing one). I've posted here a while ago a simple mod of
Parallel::ForkManager to allow back tack from the child to the parent, but
it send the data as a slug of serialized data, not as a stream. Whether
this is deadly or not depends on how many rows.

http://groups.google.com/group/comp.lang.perl.modules/browse_frm/thread/9c8
ef79472740156/

Also, there is Parallel::Simple, which I never used but looks promising.

my $pid;
for my $kid (@kids) {
$pid = open( my $fh, '|-');
die "Can't fork: $!\n" if not defined $pid;

@{ $kid }{ 'pid','handle' } = ($pid,$fh);
}

if ( $pid == 0) { # I'm one of the children

How does the child know which child it is? It would have to loop over
@kids comparing it's pid to the stored pid (which it can't do, because
the stored pid is stored only in the parent, not the child). Better to
move this else block into the inside of the the for my $kid loop, that way
the kid automatically know who it is by looking in $kid.

....

# wait for results
# foreach @row = $sth->fetchrow_array

Technically, you can't waid for results as a separate action from
calling fetchrow_array. fetchrow_array is inherently waiting for results.

# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)

exit. You almost certainly don't want to waitpid (for what? The parent?).
At this point, you stuffed everything you have into the buffer up to the
parent. That is all you an do. If the parent doesn't read it all, what
is the child to do about it?

}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row

You will want to "close $kids[0]{handle} or [die|warn] $! $?;" to make sure
the kid ended things on a good note. The close implicilty waits for the
the child to exit.

Xho

Rocco Caputo · Nov 25, 2006

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise.

http://search.cpan.org/search?query=POE+DBI&mode=dist

DJ Stunks · Nov 25, 2006

Rocco said:
I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

Click to expand...

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise.

http://search.cpan.org/search?query=POE+DBI&mode=dist

this looks great, time to start reading

you know POE had occurred to me, my search terms must have been bad
choices.

Thanks to both who responded,
-jp

Peter J. Holzer · Nov 26, 2006

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete

Theoretically, that's possible. In practice, database client libaries
don't support that.

[...]

if ( $pid == 0) { # I'm one of the children

# connect to the db
# prepare query
# execute query
# wait for results
# foreach @row = $sth->fetchrow_array
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row
#
# while ( my $line = < $kids[1]{handle} >) {
# etc.
}

You are reading all the lines from kid 0 before the first line from kid 1
here. This means that kid 1 will block as soon as it has written enough
records to fill the pipe (typically a few kB). Not a problem if your
queried only return a few rows, but if they are long-running because
they return lots of rows, they won't really run in parallel.

You may want to look at select (or IO::Select), but be warned that select
and <> don't mix well - you need to use sysread instead.

hp

DJ Stunks · Nov 27, 2006

DJ said:
Rocco said:

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

Click to expand...

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise.

http://search.cpan.org/search?query=POE+DBI&mode=dist

Click to expand...

this looks great, time to start reading

Well, I went with EasyDBI which looks ideal, however, my queries are
both coming back with the error "Died". The queries work through
straight DBI, and both of them take the right amount of time through
EasyDBI, but no rows return, just the error.

I'm not sure if anyone is familiar with this module. It doesn't look
like I can get any more detailed debugging as to what "Died". I'll try
one of the other 4 suggestions tomorrow and see how it goes.

If it matters, the queries use the DBD::ODBC driver and I'm running on
Win32 (ActiveState Perl 5.8.7 [813], DBD::ODBC 1.13, POE 0.9500,
EasyDBI 1.14)

Thanks for the suggestion though, I'll keep at it.
-jp

Linux: using "clone3" and "waitid"	0	Oct 17, 2023
fork,exec, and parallel processing	4	Mar 26, 2007
How to synchronize I/O (Parallel::ForkManager)	1	Jul 28, 2010
Need suggestions to improve code	2	Aug 24, 2013
Daemon Server, Forking, Defunct Processes	2	Sep 11, 2008
PID of exec	0	Oct 25, 2007
Parallel::ForkManager, Net::HTTP and catching Timeout::Error	2	Dec 30, 2009
behavior of run_on_wait in Parallel::ForkManager	0	Nov 2, 2004

speaking of forking -- parallel database fetches?

DJ Stunks

xhoster

Rocco Caputo

DJ Stunks

Peter J. Holzer

DJ Stunks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads