So what happens if there’s an error part way? Do you end up with some
records inserted and others not?
Since the load is usually used to populate an empty table, you simply
empty the table and do it again if it doesn't complete for some reason.
If you're doing a partial load you'd normally dump the table and/or take
a backup beforehand.
That’s not acceptable in a situation where other processes are
concurrently making queries on the table.
Obviously, but you're unlikely to be running anything while you're doing
a bulk load. If the database is used to support interactive processing
you'd normally only use bulk loading during recovery after a disk failure
or as part of a database migration or restructuring exercise.
In fact the only case I know of where bulk loads are a normal part of the
data life cycle is for data warehousing, where bulk loads are the norm.
The last month's transactions are removed from the online system and bulk
loaded into the data warehouse for analysis. In this case the bulk loads
are either done overnight or during a weekend. However, any RDBMS that
has been optimised for data warehousing will almost certainly have the
ability to segment its fact table. This has two benefits:
- you can bulk load a new segment offline and put it online when the
load is complete. This can be done during normal operation.
- the DBMS can automatically split a query and run parallel copies
against each segment before combining the result sets for return
to the program that made the query.
And how does this address the point of whether to load the new records
into a new table or the same table?
That's system design choice. There are a limited set of options and its
up to the designer to choose one:
- if you want to load data into a new table and rename CURRENT->OLD,
NEW->CURRENT you have to design all queries to operate on both OLD
and CURRENT tables
- if your DBMS allows table segmentation and gives the ability to offline
segments you can bulk load as I already described.
- you can stop the system during a quiet time while you do a bulk load
and/or minimise overheads by using very large transaction.
- you can bite the bullet, accept the performance hit and simply run a
relatively slow load process using small transactions during normal
operation.
There are different overheads associated with each choice. A good system
designer will understand them, know their relative weight for the target
hardware and database, and use this information to optimise the design.
What I said about transactional and journalling overheads was
explanation: unless you know that you'll never understand why you should
expect bulk loading to be a lot faster than doing it with transactions
designed to logically group related operations or (much, much worse) to
use implied commits. This really is the worse of all worlds since it
maximises journalling overhead.