S
s.ross
Architecting large Web service downloads in Ruby.
I've run up against a performance bottleneck (that I could have
predicted), but fear my design is causing more grief. Here's an outline:
Background
==========
My app has to download from 1,000 to 20,000 rows from a third-party
Web service (over HTTP, using xml-rpc, which uses net/http). No row
has any indicator when it was last updated, so local caches are
difficult if not impossible to reliably maintain. Everything has to be
considered "dirty". These rows can be downloaded in batches of up to
100, so it's on the order of 90 - 120 seconds over a quick net
connection to grab them and insert them into the primary table
synchronously.
The rub is that each row has a single detail row that is quite a bit
bulkier. Each of the master rows has an Active flag, and at any given
time between 50 and 80% of them are active. Iterating all the active
rows and populating the detail rows with individual Web service calls
takes on the order of 45-85 minutes, which is the real performance
problem. The data is usable without the detail information, but
minimally so.
The Question
============
Assuming we can't improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism? I took a hack at it by creating a thread
pool, keeping up to 50 alive at once. Obviously tunable. However, the
problem is that every so often, a response is pretty darn garbled. Is
there a thread-safety issue in net/http that is causing results to be
stepped on? If that's the case, does a different approach suggest
itself? Or is it a "you're screwed, be patient" situation?
Thanks for reading, and HAHAHAHAHAHAHA is a perfectly acceptable
answer
--steve
I've run up against a performance bottleneck (that I could have
predicted), but fear my design is causing more grief. Here's an outline:
Background
==========
My app has to download from 1,000 to 20,000 rows from a third-party
Web service (over HTTP, using xml-rpc, which uses net/http). No row
has any indicator when it was last updated, so local caches are
difficult if not impossible to reliably maintain. Everything has to be
considered "dirty". These rows can be downloaded in batches of up to
100, so it's on the order of 90 - 120 seconds over a quick net
connection to grab them and insert them into the primary table
synchronously.
The rub is that each row has a single detail row that is quite a bit
bulkier. Each of the master rows has an Active flag, and at any given
time between 50 and 80% of them are active. Iterating all the active
rows and populating the detail rows with individual Web service calls
takes on the order of 45-85 minutes, which is the real performance
problem. The data is usable without the detail information, but
minimally so.
The Question
============
Assuming we can't improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism? I took a hack at it by creating a thread
pool, keeping up to 50 alive at once. Obviously tunable. However, the
problem is that every so often, a response is pretty darn garbled. Is
there a thread-safety issue in net/http that is causing results to be
stepped on? If that's the case, does a different approach suggest
itself? Or is it a "you're screwed, be patient" situation?
Thanks for reading, and HAHAHAHAHAHAHA is a perfectly acceptable
answer
--steve