PEP 249 (database api) -- executemany() with iterable?

Roy Smith · Oct 12, 2010

PEP 249 says about executemany():

Prepare a database operation (query or command) and then
execute it against all parameter sequences or mappings
found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

Jon Clements · Oct 12, 2010

PEP 249 says about executemany():

Prepare a database operation (query or command) and then
execute it against all parameter sequences or mappings
found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

I'm not understanding (probably me). Example?

Roy Smith · Oct 12, 2010

I'm not understanding (probably me). Example?

I have a dictionary, d, which has a million items in it, and want to
do something like:

executemany("update foo set bar = %s where id = %s",
d.iteritems())

If executemany accepted an iterable, that would work. But, it only
accepts a sequence, so I need to do:

executemany("update foo set bar = %s where id = %s", d.items())

which generates a million-item temporary list. Or am I mis-
understanding the PEP?

Jon Clements · Oct 12, 2010

I have a dictionary, d, which has a million items in it, and want to
do something like:

executemany("update foo set bar = %s where id = %s",
d.iteritems())

If executemany accepted an iterable, that would work. But, it only
accepts a sequence, so I need to do:

executemany("update foo set bar = %s where id = %s", d.items())

which generates a million-item temporary list. Or am I mis-
understanding the PEP?

Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.

Terry Reedy · Oct 12, 2010

PEP 249 says about executemany():

Prepare a database operation (query or command) and then
execute it against all parameter sequences or mappings
found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

That question would best be addressed to the pep author
Marc-André Lemburg <mal at lemburg.com>

Ned Deily · Oct 12, 2010

Jon Clements said:
Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

Also, keep in mind that PEP 249 DB adapter implementations are typically
wrappers around a lower-level client library for a particular DB
implementation and that most of those client APIs - written in C - will
likely require all of the items to be passed in one call. If so, the DB
adapter would need to immediately evaluate the iterable and produce a
list in memory anyway.

Jon Clements · Oct 12, 2010

Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.

Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something "cleverer")
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.

Cheers,

Jon.

Antoine Pitrou · Oct 12, 2010

That question would best be addressed to the pep author
Marc-André Lemburg <mal at lemburg.com>

Or, rather to python-ideas. Python development isn't supposed to happen
in private.

Regards

Antoine.

Petite Abeille · Oct 12, 2010

4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)

Not ANSI. Perhaps with a subquery, which is ANSI-compliant.

J. Gerlach · Oct 12, 2010

Am 12.10.2010 17:10, schrieb Roy Smith:

[A]re there any plans to update the api to allow an iterable instead of
a sequence?

sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

import sqlite3 as sql

connection = sql.connect(":memory:")

cursor = connection.execute("""
CREATE TABLE test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT)
;""")
connection.commit()
cursor.executemany("""
INSERT INTO test (text) VALUES ( ? );
""",
# A generator expression - delivers one row at a time
( ("hello nr %03d!" % i,) for i in xrange(100)))
connection.commit()
cursor.execute("SELECT * FROM test")

for id_, text in cursor.fetchall():
print text, id_

John Nagle · Oct 12, 2010

Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something "cleverer")
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.

Right. In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.
This (in MySQL) is tens to hundreds of times faster than
doing individual INSERT or UPDATE commands.

John Nagle

Lawrence D'Oliveiro · Oct 13, 2010

In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

Iâ€™ve done this. Not millions, but certainly on the order of tens of
thousands.

Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.

May not always be flexible enough.

This (in MySQL) is tens to hundreds of times faster than doing individual
INSERT or UPDATE commands.

Why should that be? The database cannot read a text file any faster than I
can.

Steve Howell · Oct 13, 2010

I’ve done this. Not millions, but certainly on the order of tens of
thousands.

May not always be flexible enough.

Why should that be? The database cannot read a text file any faster than I
can.

Bulk-load strategies usually solve one or more of these problems:

network latency
index maintenance during the upload
parsing of the SQL
reallocating storage

They usually come at some cost or inconvenience:

only work locally
indexes not available during the bulk load
requires non-SQL data format
up-front cost to preallocate storage

The exact benefits and costs will vary according to the system, of
course. Also, you might employ multiple strategies; for example, for
a medium-sized problem, there may be no need to sacrifice the
familiarity and convenience of SQL if you can arrange for large
batches of SQL to be submitted from a local client. If indexes are
the problem, and you don't need 24/7 availability, you can drop them
and rebuild them.

Jon Clements · Oct 13, 2010

Am 12.10.2010 17:10, schrieb Roy Smith:

[A]re there any plans to update the api to allow an iterable instead of
a sequence?

Click to expand...

sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

import sqlite3 as sql

connection = sql.connect(":memory:")

cursor = connection.execute("""
CREATE TABLE test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT)
;""")
connection.commit()
cursor.executemany("""
INSERT INTO test (text) VALUES ( ? );
""",
# A generator expression - delivers one row at a time
( ("hello nr %03d!" % i,) for i in xrange(100)))
connection.commit()
cursor.execute("SELECT * FROM test")

for id_, text in cursor.fetchall():
print text, id_

What happens if you do itertools.repeat(0) instead of xrange(100) ?

J. Gerlach · Oct 13, 2010

Am 13.10.2010 14:26, schrieb Jon Clements:

Am 12.10.2010 17:10, schrieb Roy Smith:

[A]re there any plans to update the api to allow an iterable instead of
a sequence?

Click to expand...

[sqlite3 example snipped]

Click to expand...

What happens if you do itertools.repeat(0) instead of xrange(100) ?

And there I see why I don't want ``executemany()`` to take a generator
-- the never-ending ones will do some damage.

But on a similar note: would a ``execute_n_repeats(statement, iterable,
maximum_rows)`` be better?

Jörg

Steven D'Aprano · Oct 13, 2010

Iâ€™ve done this. Not millions, but certainly on the order of tens of
thousands.

"I've done it" doesn't disprove that it is the wrong thing to do. It may
simply mean that you too are doing it wrong.

I have no opinion either way, I'm just saying.

John Nagle · Oct 13, 2010

Iâ€™ve done this. Not millions, but certainly on the order of tens of
thousands.

It's a scaling issue. Millions of INSERT or UPDATE requests can
take hours. That's when you need the speedup of bulk loading.

May not always be flexible enough.

True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.

Why should that be? The database cannot read a text file any faster than I
can.

Because the indices are updated in one big update, rather than
after each change to the database.

Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction. The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

I have some bulk databases which are updated from external
sources. The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming. That load takes hours, once a week.

(All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different. But a million records
is big for SQLite, which is, after all, a "lite" database.
At 10,000 records, you don't have to worry about any of this
stuff. At 1,000,000 records, you do.)

John Nagle

Lawrence D'Oliveiro · Oct 14, 2010

In message

Steve said:
Bulk-load strategies usually solve one or more of these problems:

network latency

Thatâ€™s not an issue. This is a bulk operation, after all.

index maintenance during the upload

There are usually options to temporarily turn this off.

parsing of the SQL

Not usually a problem, as far as I can tell.

reallocating storage

You mean for thr SQL? Again, not usually a problem, as far as I can tell.

Lawrence D'Oliveiro · Oct 14, 2010

Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction. The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

How does â€œload dataâ€ avoid this? Is that not a transaction too?

It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

Is that to avoid impacting searches on the existing data while loading the
new records?

Seems to me this isnâ€™t going to help, since both old and new tables are on
the same disk, after all. And itâ€™s the disk access thatâ€™s the bottleneck.

Martin Gregorie · Oct 14, 2010

How does â€œload dataâ€ avoid this? Is that not a transaction too?

Not usually. Its faster because there's no journalling overhead. The
loader takes out an exclusive table lock, dumps the data into the table,
rebuilds indexes and releases the lock. I can't comment about MySQL
(don't use it) but this has been the case on the RDBMS databases I have
used.

Seems to me this isnâ€™t going to help, since both old and new tables are
on the same disk, after all. And itâ€™s the disk access thatâ€™s the
bottleneck.

There's a lot of overhead in journalling - much more than in applying
changes to a table. The before and after images *must* be flushed to disk
on commit. In UNIX terms fsync() must be called on the journal file(s)
and this is an expensive operation on all OSes because committing a
series of small transactions can cause the same disk block to be written
several times. However, the table pages can safely be left in the DBMS
cache and flushed as part of normal cache operation since, after a crash,
the table changes can always be recovered from a journal roll-forward. A
good DBMS will do that automatically when its restarted.

DB API question - where is a stored procedure's return value?	0	Mar 12, 2014
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014
[ANN] InformixDB-2.3 released	0	Oct 1, 2006
Cleaning out the cobwebs in the PEP cupboard	0	Apr 23, 2009
pre-PEP generic objects	44	Nov 29, 2004
PEP 354: Enumerations in Python	77	Feb 27, 2006
PEP 384: Defining a Stable ABI	0	May 17, 2009
PEP 372 -- Adding an ordered directory to collections	28	Jun 16, 2008

PEP 249 (database api) -- executemany() with iterable?

Roy Smith

Jon Clements

Roy Smith

Jon Clements

Terry Reedy

Ned Deily

Jon Clements

Antoine Pitrou

Petite Abeille

J. Gerlach

John Nagle

Lawrence D'Oliveiro

Steve Howell

Jon Clements

J. Gerlach

Steven D'Aprano

John Nagle

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Martin Gregorie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads