Using python to delta-load files into a central DB

Chris Nethery · Apr 12, 2007

Hello everyone,

I have a challenging issue I need to overcome and was hoping I might gain
some insights from this group.

I am trying to speed up the process I am using, which is as follows:

1) I have roughly 700 files that are modified throughout the day by users,
within a separate application

2) As modifications are made to the files, I use a polling service and mimic
the lock-file strategy used by the separate software application

3) I generate a single 'load' file and bulk insert into a load table

4) I update/insert/delete from the load table

This is just too time consuming, in my opinion.

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

What I would like to do is spawn separate processes and load only the delta
data. The data must be 100% reliable, so I'm leary of using something like
difflib. I also want to make sure that my code scales since the number of
files is ever-increasing.

I would be grateful for any feedback you could provide.

Thank you,

Chris Nethery

Gabriel Genellina · Apr 13, 2007

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?

What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.

Why don't you like difflib? AFAIK it has no known bugs.

Chris Nethery · Apr 13, 2007

Gabriel,

Thank you for your reply.

Yes, they are tab-delimited text files that will change very little
throughout the day.

But, this is messy, antiquated 80s junk, nonetheless.

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.

Generally rows will either be added or changed, but sometimes rows will be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows and
sometimes certain column values are recalculated.

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

Thank you,

Chris Nethery

Gabriel Genellina · Apr 13, 2007

Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.
Ugh...

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...

Chris Nethery · Apr 13, 2007

Gabriel,

I think that would work well. Also, thank you for suggesting the use of
filecmp. I have never used this module, but it looks like a much better
solution than what I had been doing previously--using os.stat and performing
a DB lookup in order to verify that the filename and timestamp existed in a
'file update' table. Also, if the only limitation to difflib is that both
files reside in memory, I should be fine. The largest of all of these files
is just over 200k, which should be fine. If memory serves me right, I can't
use more than 4MB, so I should be fine. And, if I spawn separate processes
for generating the delta files, I should be able to speed things up even
more.

Thanks again for your help!

Best Regards,

Christopher Nethery

How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	2	Mar 22, 2024
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Need to load 200 users into Form security DB	7	Mar 5, 2009
python daemon - compress data and load data into MySQL by pyodbc	9	Sep 3, 2009
Auto-authorization for Google Drive using Python in Mac OS X	1	Nov 4, 2013
How to load a GMAP into a Div using Ajax. cross-domain problem??	1	Aug 11, 2009
importing .txt data into sqlite using python	0	Dec 17, 2008
How to round trip python and sqlite dates	1	Nov 17, 2013

Using python to delta-load files into a central DB

Chris Nethery

Gabriel Genellina

Chris Nethery

Gabriel Genellina

Chris Nethery

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads