Zamdrist said:
I'm looking into SQLXMLBulkLoad Com object right now, looks like what
I'll need to use.
Looks like a reasonable place to start. Of course it might not give
you the table structure you want, but consider loading into an ugly
scratch table, then fixing that through the SQL you're familiar with.
Some notes:
1. It's the web. The web is slow, broken and wrong, usually at least
one of these at any time. Get your content down into a local temp file
before trying to throw it at a bulk loader COM object. Even if it
_claims_ to load from any URL, it probably won't do so reliably. The
XMLHTTP COM object will be useful here.
2. Encoding and entiities. You shouldn't need to know about these,
until one day it all breaks.
2.1 Use UTF-8 for everything you make, and allow as many tools as
possible to sniff for encodings on incoming content - they're usually
good at it. Don't specify encodings explicitly unless you have -
they're a contract to get it right in the future too, and if you can't
guarantee this to work you're just writing tomorrow's bugs for
yourself.
2.2 You shouldn't need entities and shouldn't use them. If things
break because your supplier used them (depressingly common in the RSS
world) it's because they've assumed the pre-declaration of the HTML
entity set (typically accented foreign language characters).
3. I wouldn't use VBScript for _anything_, JavaScript / JScript beats
it every time (just for the error handling, if nothing else)
4. You need an XMLSchema to make things work with SQLXMLBulkLoad This
will affect both data typing and the inferred table layout.
5. As with any ETL, pull the thing in first to an unconstrained table,
then worry about setting up constraints, indexes and foreign keys
afterwards. Drop them if you have to before the import, then re-create
-- far too slow to do it row-by-row.
6. Assume it will break. Don't throw away the old data until you've
loaded the new data. This includes caching the incoming XML document,
validating it against the schema, checking it's not a simple (but
valid) error message saying "Service off-line today", loading it into
the DB (any old table), then loading it into the DB (the right tables,
structure, and fully indexed and keyed). _Any_ of these steps can
fail, and in a large decoupled near-real-time system like this, they
will do so from time to time.
7. Time your downloads. Time the intervals between them. Watch out for
whether the data is actually changing. Use some variant on an
exponential back-off algorithm so that you catch the changes promptly,
but don't hammer on their servers every few minutes (try measuring the
actual change rate and re-loading at twice this frequency) Maybe they
publish metadata suggest how often to try it, or they offer a
lightweight notification service to tell you when to grab the big one.
8. Log everything. Log when you tried, when it worked, when it failed,
and how long it took. If it takes twice as long to download at 3am,
then don't try and grab it at 3am! They're probably running some
cron'ed batch job at that time of night and you're hitting it. Making
reliable systems means caring about this level of trivia.
9. Be polite. Hammering a feed service's servers too much will tend to
get your IP blocked.