psycopg2: connect copy_from and copy_to

T

Thomas Guettler

Hi,

I want to copy data from a production database to
a testing database. Up to now I used psycopg2 and
copy_from and copy_to with two cursors and a temporary file.

This works, but it would be better, if the data
gets piped from one cursor to the next without a temporary
file.

The psycopg2 methods look like this:
copy_to(fd, table) # read data
copy_from(fd, table) # write data

Using select (wait until fd has data) does not work, since
I pass in the fd. If copy_to would return the file descriptor,
I could use it to pass the data to copy_from.

I found no way to connect them. Maybe it could be done
with threading, but I try to avoid threads.

Since the script will run only on Unix, I could use pipes. But
this would fail, too, since copy_to() would block before I can
call copy_from (since buffer is full and no one is reading the data
from the pipe).

Any suggestions?

Thomas
 
C

Chris

Hi,

I want to copy data from a production database to
a testing database. Up to now I used psycopg2 and
copy_from and copy_to with two cursors and a temporary file.

This works, but it would be better, if the data
gets piped from one cursor to the next without a temporary
file.

The psycopg2 methods look like this:
copy_to(fd, table) # read data
copy_from(fd, table) # write data

Using select (wait until fd has data) does not work, since
I pass in the fd. If copy_to would return the file descriptor,
I could use it to pass the data to copy_from.

I found no way to connect them. Maybe it could be done
with threading, but I try to avoid threads.

Since the script will run only on Unix, I could use pipes. But
this would fail, too, since copy_to() would block before I can
call copy_from (since buffer is full and no one is reading the data
from the pipe).

Any suggestions?

Thomas

Doesn't PostGres come with Export/Import apps ? That would be easiest
(and faster).

Else,

prod_cursor.execute('select data from production')
for each_record in cursor.fetchall():
dev_cursor.execute('insert into testing')

that is one way of doing it
 
J

james.pye

Doesn't PostGres come with Export/Import apps ? That would be easiest
(and faster).

Yes, PostgreSQL core has import/export apps, but they tend to target
general administration rather than transactional loading/moving of
data. ie, dump and restore a database or schema. There is a pgfoundry
project called pgloader that appears to be targeting a more general
scenario of, well, loading data, but I imagine most people end up
writing custom ETL for data flow.
Else,

prod_cursor.execute('select data from production')
for each_record in cursor.fetchall():
    dev_cursor.execute('insert into testing')

that is one way of doing it

In any high volume cases you don't want to do that. The current best
practice for loading data into an existing PostgreSQL database is
create temp, load into temp using COPY, merge from temp into
destination(the last part is actually the tricky one ;).
 
T

Thomas Guettler

Doesn't PostGres come with Export/Import apps ? That would be easiest
(and faster).

Yes, you can use "pg_dump production ... | psql testdb", but
this can lead to dead locks, if you call this during
a python script which is in the middle of a transaction. The python
script locks a table, so that psql can't write to it.

I don't think calling pg_dump and psql/pg_restore is faster.
prod_cursor.execute('select data from production')
for each_record in cursor.fetchall():
dev_cursor.execute('insert into testing')

I know, but COPY is much faster.

Thomas
 
J

james.pye

Any suggestions?

If you don't mind trying out some beta quality software, you can try
my pg_proboscis driver. It has a DBAPI2 interface, but for you to use
COPY, you'll need to use the GreenTrunk interface:

import postgresql.interface.proboscis.dbapi2 as db
# yeah, it doesn't use libpq, so you'll need to "spell" everything
out. And, no dsn either, just keywords.
src = db.connect(user = 'pgsql', host = 'localhost', port = 5432,
database = 'src')
dst = db.connect(suer = 'pgsql', host = 'localhost', port = 5432,
database = 'dst')

fromq = src.greentrunk.Query("COPY tabl TO STDOUT")
toq = dst.greentrunk.Query("COPY tabl FROM STDIN")

toq(fromq())


It's mostly pure-python, so if you don't have any indexes on the
target table, you'll probably only get about 100,000 - 150,000 records
per second(of course, it depends on how beefy your CPU is). With
indexes on a large destination table, I don't imagine the pure Python
COPY being the bottleneck.

$ easy_install pg_proboscis

Notably, it currently(version 0.9) uses the qmark paramstyle, but I
plan to make 1.0 much more psyco. =)
[python.projects.postgresql.org, some of the docs are outdated atm due
to a recent fury of development =]
 
C

Chris

Yes, you can use "pg_dump production ... | psql testdb", but
this can lead to dead locks, if you call this during
a python script which is in the middle of a transaction. The python
script locks a table, so that psql can't write to it.

I don't think calling pg_dump and psql/pg_restore is faster.


I know, but COPY is much faster.

Thomas

I'm used to Oracle which doesn't exhibit this problem... ;)
 
J

james.pye

Yes, you can use "pg_dump production ... | psql testdb", but
this can lead to dead locks, if you call this during
a python script which is in the middle of a transaction. The python
script locks a table, so that psql can't write to it.

Hrm? Dead locks where? Have you considered a cooperative user lock?
Are just copying data? ie, no DDL or indexes?
What is the script doing? Updating a table with unique indexes?
I don't think calling pg_dump and psql/pg_restore is faster.

Normally it will be. I've heard people citing cases of COPY at about a
million records per second into "nicely" configured systems.
However, if psycopg2's COPY is in C, I'd imagine it could achieve
similar speeds. psql and psycopg2 both being libpq based are bound to
have similar capabilities assuming the avoidance of interpreted Python
code in feeding the data to libpq.
I know, but COPY is much faster.

yessir.
 
T

Thomas Guettler

If you don't mind trying out some beta quality software, you can try
my pg_proboscis driver. It has a DBAPI2 interface, but for you to use
COPY, you'll need to use the GreenTrunk interface:

Up to now I am happy with psycopg2. Why do you develop pg_proboscis?

Thomas
 
J

james.pye

Up to now I am happy with psycopg2.

Yeah. psyco is good.
Why do you develop pg_proboscis?

[Good or bad as they may be]

1. Alternate interface ("greentrunk")
2. Non-libpq implementation yields better control over the wire that
allows:
a. Custom communication channels (not limited to libpq's
worldview)
b. Leveraging of protocol features that libpq's API does not fully
accommodate (think bulk INSERTs using prepared statements with less
round-trip costs)
c. Allows custom "sub-protocols"(I use this to implement a remote
python command/console[pdb'ing stored Python procedures, zero network
latency
scripts]).
3. Makes use of binary types to reduce bandwidth usage. (I started
developing this before libpq had the ability to describe statements to
derive statement
parameter types and cursor column types(?), so using the binary
format was painful at best)
4. Has potential for being used in event driven applications without
threads.
5. Better control/understanding of running queries allows for
automatic operation interrupts in exception cases:
[The last two may be possible using libpq's async interfaces, but
I'm not entirely sure]
6. Arguably greater(well, *easier* is likely a better word)
portability. While I have yet to get it to work with other
Pythons(pypy, ironpython, jython), the potential to work with these
alternate implementations is there. The real impediment here is
missing/inconsistent features in the implementations(setuptools
support, missing os module in ironpython(i know, i know, and I don't
care. Last time I checked, it's missing from the default install
that's broken :p), jython is still at 2.2, iirc)
7. Bit of a license zealot. psycopg2 is [L?]GPL, pg_proboscis is
BSD[or MIT if you like] like PostgreSQL. (Yes, Darcy's interface is
BSD licensed(iirc), but it too is libpq based)

In sum, it yields greater control over the connection which I believe
will lead to a more flexible and higher quality interface than a libpq
solution.
[The primary pain point I've had is implementing all the
authentication mechanisms supported by PG]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top