Q: Architecting large Web service download app in Ruby

S

s.ross

Architecting large Web service downloads in Ruby.

I've run up against a performance bottleneck (that I could have
predicted), but fear my design is causing more grief. Here's an outline:

Background
==========

My app has to download from 1,000 to 20,000 rows from a third-party
Web service (over HTTP, using xml-rpc, which uses net/http). No row
has any indicator when it was last updated, so local caches are
difficult if not impossible to reliably maintain. Everything has to be
considered "dirty". These rows can be downloaded in batches of up to
100, so it's on the order of 90 - 120 seconds over a quick net
connection to grab them and insert them into the primary table
synchronously.

The rub is that each row has a single detail row that is quite a bit
bulkier. Each of the master rows has an Active flag, and at any given
time between 50 and 80% of them are active. Iterating all the active
rows and populating the detail rows with individual Web service calls
takes on the order of 45-85 minutes, which is the real performance
problem. The data is usable without the detail information, but
minimally so.

The Question
============

Assuming we can't improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism? I took a hack at it by creating a thread
pool, keeping up to 50 alive at once. Obviously tunable. However, the
problem is that every so often, a response is pretty darn garbled. Is
there a thread-safety issue in net/http that is causing results to be
stepped on? If that's the case, does a different approach suggest
itself? Or is it a "you're screwed, be patient" situation?

Thanks for reading, and HAHAHAHAHAHAHA is a perfectly acceptable
answer :)

--steve
 
J

James Gray

Is there a thread-safety issue in net/http that is causing results
to be stepped on?

I can't say I know for sure, but I'm doubting it.
If that's the case, does a different approach suggest itself?

Well, you can fork() processes instead of Threads, if your not on
Windows. This would eliminate Thread safety concerns. You may have
to work out some IPC issues though if each process can't work totally
on its own.

James Edward Gray II
 
S

s.ross

I can't say I know for sure, but I'm doubting it.


Well, you can fork() processes instead of Threads, if your not on
Windows. This would eliminate Thread safety concerns. You may have
to work out some IPC issues though if each process can't work
totally on its own.

James Edward Gray II

Ok, I looked into it further, and there is a thread safety issue and
it's in the xml-rpc library. If you use the call method in a thread,
you can have the response buffer overwritten by a response in another
thread. However, if you use call_async, then a new server connection
and response buffer is used, thereby rendering it thread safe. And the
performance win is astonishing!

Using a separate process with database connections either means dRb
and some interesting IPC, BackgroundRb, or something else like that.
I'm just pleased that I could bring the average turnaround per request
from 65ms to 4ms (leveled at 1000 combination calls the the master and
detail services). W00t!
 
P

Phlip

s.ross said:
Assuming we can't improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism?

Run the entire downloader from a cron task.

Your question ass-umes that you must run out of one controller action. Wrong
mindset!

And BTW 10,000 XML records should be trivial, so you might look for a bottleneck
there. I would not read them all as a huge Ruby string and then convert them
into a huge DOM model in memory. That would thrash. I would use what I think is
called the "SAX" model of reading, where you register a callback for each node
type, then let your reader stream them in...
 
S

s.ross

Run the entire downloader from a cron task.

Your question ass-umes that you must run out of one controller
action. Wrong mindset!

I hope that's not what my question ass-umes. I am able to get the
master records in chunks of 20-100. And they parse just fine. The hope
was to make the detail retrieval of these records happen in parallel
with fetching the next batch -- which I have successfully done.
And BTW 10,000 XML records should be trivial, so you might look for
a bottleneck there. I would not read them all as a huge Ruby string
and then convert them into a huge DOM model in memory. That would
thrash. I would use what I think is called the "SAX" model of
reading, where you register a callback for each node type, then let
your reader stream them in...

Using DOM callbacks is just fine in the event you have a poorly
bounded rowset count. I have a pretty well-bounded count and parsing
the chunked data makes it quite manageable without callbacks.

I had considered the cron task but that's one step ahead of where I am
right now. I'm running them from the console to determine the
acceptability of how the thing is architected. As I noted in a
followup post to the list, I discovered that using XmlRpc::Client#call
can expose some potential data corruption in a multi-threaded
implementation. However, XmlRpc::Client#async_call does not have that
same problem, and by shifting the detail record fetch process into
threads that begin after each chunk of master records are read, I
increased the effective processing efficiency by around 2.5x because
while the next master Web service fetch was blocking on the response,
all the little detail fetches were purring right along in their own
threads.

Thx,

Steve
 
H

hemant

[Note: parts of this message were removed to make it a legal post.]

I hope that's not what my question ass-umes. I am able to get the master
records in chunks of 20-100. And they parse just fine. The hope was to make
the detail retrieval of these records happen in parallel with fetching the
next batch -- which I have successfully done.

And BTW 10,000 XML records should be trivial, so you might look for a

Using DOM callbacks is just fine in the event you have a poorly bounded
rowset count. I have a pretty well-bounded count and parsing the chunked
data makes it quite manageable without callbacks.

I had considered the cron task but that's one step ahead of where I am
right now. I'm running them from the console to determine the acceptability
of how the thing is architected. As I noted in a followup post to the list,
I discovered that using XmlRpc::Client#call can expose some potential data
corruption in a multi-threaded implementation. However,
XmlRpc::Client#async_call does not have that same problem, and by shifting
the detail record fetch process into threads that begin after each chunk of
master records are read, I increased the effective processing efficiency by
around 2.5x because while the next master Web service fetch was blocking on
the response, all the little detail fetches were purring right along in
their own threads.
or port XML-RPC so as it works from evented architecture such as
EventMachine or Packet (in which case you can use traditional workers for
concurrent download)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top