python file synchronization

S

silentnights

Hi All,

I have the following problem, I have an appliance (A) which generates
records and write them into file (X), the appliance is accessible
throw ftp from a server (B). I have another central server (C) that
runs a Django App, that I need to get continuously the records from
file (A).

The problems are as follows:
1. (A) is heavily writing to the file, so copying the file will result
of uncompleted line at the end.
2. I have many (A)s and (B)s that I need to get the data from.
3. I can't afford losing any records from file (X)

My current implementation is as follows:
1. Server (B) copy the file (X) throw FTP.
2. Server (B) make a copy of file (X) to file (Y.time_stamp) ignoring
the last line to avoid incomplete lines.
3. Server (B) periodically make copies of file (X) and copy the lines
starting from previous ignored line to file (Y.time_stamp)

4. Server (C) mounts the diffs_dir locally.
5. Server (C) create file (Y.time_stamp.lock) on target_dir then copy
file (Y.time_stamp) to local target_dir then delete
(Y.time_stamp.lock)

6. A deamon running in Server (C) read file list from the target_dir,
and process those file that doesn't have a matching *.lock file, this
procedure to avoid reading the file until It's completely copied.

The above is implemented and working, the problem is that It required
so many syncs and has a high overhead and It's hard to debug.

I greatly appreciate your thoughts and suggestions.

Lastly I want to note that am not a programming guru, still a noob,
but I am trying to learn from the experts. :)
 
C

Cameron Simpson

| I have the following problem, I have an appliance (A) which generates
| records and write them into file (X), the appliance is accessible
| throw ftp from a server (B). I have another central server (C) that
| runs a Django App, that I need to get continuously the records from
| file (A).
|
| The problems are as follows:
| 1. (A) is heavily writing to the file, so copying the file will result
| of uncompleted line at the end.
| 2. I have many (A)s and (B)s that I need to get the data from.
| 3. I can't afford losing any records from file (X)
[...]
| The above is implemented and working, the problem is that It required
| so many syncs and has a high overhead and It's hard to debug.

Yep.

I would change the file discipline. Accept that FTP is slow and has no
locking. Accept that reading records from an actively growing file is
often tricky and sometimes impossible depending on the record format.
So don't. Hand off completed files regularly and keep the incomplete
file small.

Have (A) write records to a file whose name clearly shows the file to be
incomplete. Eg "data.new". Every so often (even once a second), _if_ the
file is not empty: close it, _rename_ to "data.timestamp" or
"data.sequence-number", open a new "data.new" for new records.

Have the FTP client fetch only the completed files.

You can perform a similar effort for the socket daemon: look only for
completed data files. Reading the filenames from a directory is very
fast if you don't stat() them (i.e. just os.listdir). Just open and scan
any new files that appear.

That would be my first cut.
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

Performing random acts of moral ambiguity.
- Jeff Miller <[email protected]>
 
D

Dennis Lee Bieber

After searching more yesterday, I found that local mv is atomic, so instead
of creating the lock files, I will copy the new diffs to tmp dir, and after
the copy is over, mv it to actual diffs dir, that will avoid reading It
while It's still being copied.
Are your tmp directory and your "diffs" directory on the same
physical volume? If so, "mv" is a rename operation, that only affects
the directory information. If the volumes are different, then "mv"
reverts to a copy/delete file operation.

To avoid problems in the future (say the "diffs" machine is
reconfigured with an additional drive and "tmp" is now mounted on the
new drive) you might be better off taking part of the suggestion to use
a special file name to indicate an "in-work" file...

diffs.timestamp.part

say, and when ready, just

mv diffs.timestamp.part diffs.timestamp

This leaves them in the same physical location and directory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top