Ara.T.Howard ha scritto:
plus dirwatch is really designed to setup a processing system
I don't know the internals nor the api for dirwatch, but could ypu explain
where the difference would be ?
well, dirwatch is an application vs. and api. so you don't have something
like
open('directory').on('created') do |file|
puts "#{ file } created"
end
or however you might imagine an api for watching directory events...
with dirwatch, which is a command line tool, you'd do something like this to
setup a watch
~ > dirwatch some_directory create
this initializes an sqlite database, config files, log files, generates sample
scripts, etc. all this will end up in ./some_directory/.dirwatch/. example:
jib:~ > mkdir some_directory
jib:~ > dirwatch some_directory/ create
---
/home/ahoward/some_directory:
dirwatch_dir : /home/ahoward/some_directory/.dirwatch
db : /home/ahoward/some_directory/.dirwatch/db
logs_dir : /home/ahoward/some_directory/.dirwatch/logs
config : /home/ahoward/some_directory/.dirwatch/dirwatch.conf
commands_dir : /home/ahoward/some_directory/.dirwatch/commands
if we peeked in dirwatch.conf we'd see something like
...
...
...
actions:
updated :
-
command: simple.sh
type: simple
pattern: ^.*$
timing: sync
-
command: yaml.rb
type: simple
pattern: ^.*$
timing: sync
...
...
...
(did i mention i love yaml? ;-) )
the 'actions' section is where you setup what to do on certain events. the
possible events are 'created', 'modified', 'deleted', or 'existing' (all of
which are pretty obvious) and the action 'updated' which is the union of
'created' or 'modified'. so this config is saying that, whenever a file is
updated we'll run two commands 'simple.sh' and 'yaml.rb'. note that a list of
commands can be specified - they will be run in that order. the list of
commands themselves are configured with a few paramters
command:
the command to run. the .dirwatch/commands_dir/ is pre-pended to PATH
when running commands so it's convenient to put them there. the
example/auto-generated commands are in that directory.
type:
this is the calling convention. for example simple commands are called
like
simple.sh file_that_was_updated mtime_of_that_file
and is called once for each file. yaml commands are called like
yaml.rb < (list of __every__ updated file and it's mtime on stdin in yaml format)
there are two other types but essentially you just have a choice - your
script is run once with every file or it gets all the files at once on
stdin.
pattern:
only files matching this regex will get passed to this command. dirwatch
itself has a --pattern option which causes it to see only files matching
that pattern but that affects everything. this is on a per command basis.
so you might see
updated :
-
command: gif2png
type: simple
pattern: ^.*\.gif$
timing: sync
-
command: png2ps
type: simple
pattern: ^.*\.png$
timing: sync
timing:
whether we wait for each command to finish or just spawn in the background
and collect exit_status later. this is extremely dangerous on systems
that could update 1,000,000 files at once.
next you'd simply start dirwatch using
jib:~ > dirwatch some_directory/ watch
I, [2005-07-21T09:04:48.668571 #27750] INFO -- : ** STARTED **
I, [2005-07-21T09:04:48.669050 #27750] INFO -- : config </home/ahoward/some_directory/.dirwatch/dirwatch.conf>
I, [2005-07-21T09:04:48.669252 #27750] INFO -- : flat <false>
I, [2005-07-21T09:04:48.669324 #27750] INFO -- : files_only <false>
I, [2005-07-21T09:04:48.682278 #27750] INFO -- : no_follow <false>
I, [2005-07-21T09:04:48.682358 #27750] INFO -- : pattern <>
I, [2005-07-21T09:04:48.682461 #27750] INFO -- : n_loops <>
I, [2005-07-21T09:04:48.682629 #27750] INFO -- : interval <00:05:00>
I, [2005-07-21T09:04:48.683028 #27750] INFO -- : lockfile </home/ahoward/some_directory/.dirwatch.lock>
I, [2005-07-21T09:04:48.683147 #27750] INFO -- : tmpwatch[all] <false>
I, [2005-07-21T09:04:48.683213 #27750] INFO -- : tmpwatch[nodirs] <false>
I, [2005-07-21T09:04:48.683278 #27750] INFO -- : tmpwatch[force] <true>
I, [2005-07-21T09:04:48.683454 #27750] INFO -- : tmpwatch[age] <30 days> == <2592000.0s>
I, [2005-07-21T09:04:48.683530 #27750] INFO -- : tmpwatch[rm] <rm_rf>
...
...
...
now, if i dropped a file into some_directory/ in another terminal:
jib:~/some_directory > touch a
i'd see this in the terminal running dirwatch
I, [2005-07-21T09:06:13.721967 #27839] INFO -- : ACTION.UPDATED.0.0 - cmd : simple.sh '/home/ahoward/some_directory/a' '2005-07-21 15:05:38.000000'
I, [2005-07-21T09:06:13.795296 #27839] INFO -- : ACTION.UPDATED.0.0 - exit_status : 0
the 'ACTION.UPDATED.0.0' is a uniq tag that makes finding the exit_status easy
in the event that the command was run 'async' and it's exit_status ends up in
the log 4000 lines later...
when running from the console like this the stdout of the command run shows
too, so i also saw this - the output of running simple.sh - in the terminal
running dirwatch:
dirwatch_dir: </home/ahoward/some_directory>
dirwatch_action: <updated>
dirwatch_type: <simple>
dirwatch_n_paths: <1>
dirwatch_path_idx: <0>
dirwatch_path: </home/ahoward/some_directory/a>
dirwatch_mtime: <2005-07-21 15:05:38.000000>
dirwatch_pid: <27839>
dirwatch_id: <ACTION.UPDATED.0.0>
command_line: </home/ahoward/some_directory/a 2005-07-21 15:05:38.000000>
path: </home/ahoward/some_directory/a>
mtime: <2005-07-21 15:05:38.000000>
simple.sh basically just prints it's environment and the argv it was called
with, here's the whole script:
jib:~/some_directory > cat .dirwatch/commands/simple.sh
#!/bin/sh
echo "dirwatch_dir: <$DIRWATCH_DIR>"
echo "dirwatch_action: <$DIRWATCH_ACTION>"
echo "dirwatch_type: <$DIRWATCH_TYPE>"
echo "dirwatch_n_paths: <$DIRWATCH_N_PATHS>"
echo "dirwatch_path_idx: <$DIRWATCH_PATH_IDX>"
echo "dirwatch_path: <$DIRWATCH_PATH>"
echo "dirwatch_mtime: <$DIRWATCH_MTIME>"
echo "dirwatch_pid: <$DIRWATCH_PID>"
echo "dirwatch_id: <$DIRWATCH_ID>"
echo "command_line: <$@>"
path=$1
mtime=$2
echo "path: <$path>"
echo "mtime: <$mtime>"
you'll notice quite a bit of information is passed via the environment and
that the mtime is also passed in on the command line. typical programs won't
use all this - but it's there. 'dirwatch --help' explains the meaning of
these environment variables.
so, normally you don't run like that (from the console) and instead have
something like this in your crontab to maintain an 'immortal' daemon
*/15 * * * * dirwatch /home/ahoward/some_directory watch --daemon
this does NOT start a daemon every fifteen minutes. the daemon always sets up
of a lockfile and refuses to start if one is already running. so, this just
makes sure exactly one daemon is running at all times - even after machine
reboots or if some bug causes dirwatch to crash. this may seem a bit odd but
those of you that don't have root on all your boxes in the office will
understand why it can work like that - you can setup robust daemons without
any special privledges. of course you can start it from init.d and it
supports 'start', 'stop', and 'restart' arguments too so this is trivial.
so that's it basically. dirwatch simply scans a directory, compares what it
finds to what's in it's database (sqlite), and runs appropriate actions in the
way you've configured it to do, and then sleeps for a while. it never stops,
automatically logs rolls, and does some other stuff too. there's a whole lot
of options like recursing into subdirectories, ignoring anything that's not a
file, a tmpwatch like facility built-in, etc. but you can read about that in
with --help.
cheers.
btw. i inlined the output of --help below. note that i just did a massive
re-write so some of this is a little off, but it's close.
-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
NAME
dirwatch v0.9.0
SYNOPSIS
dirwatch [ options ]+ mode [ directory = ./ ]
DESCRIPTTION
dirwatch is a tool used to rapidly build processing systems from file system
events.
dirwatch manages an sqlite database that mirrors the state of a directory and
then triggers user definable event handlers for certain filesystem activities
such file creation, modification, deletion, etc. dirwatch can also implement
a tmpwatch like behaviour to ensure files of a certain age are removed from
the directory being watched. dirwatch normally runs as a daemon process by
first sychronizing the database inventory with that of the directory and then
firing appropriate triggers as they occur.
-----------------------------------------------------------------------------
the following actions may have triggers configured for them
-----------------------------------------------------------------------------
created -> a file was detected that was not already in the database
modified -> a file in the database was detected as being modified
updated -> a file was created or modified (union of these two actions)
deleted -> a file in the database is no longer in the directory
existing -> a file in the database still exists in the directory and has not
been modified
-----------------------------------------------------------------------------
the command line 'mode' must be one of the following
-----------------------------------------------------------------------------
create (c) -> initialize the database and supporting files
watch (w) -> monitor directory and trigger actions in the foreground
start (S) -> spawn a daemon watcher in the background
restart (R) -> (re)spawn a daemon watcher in the background
stop (H) -> stop/halt any currently running watcher
status (T) -> determine if any watcher is currently running
truncate (D) -> truncate/delete all entries from the database
archive (a) -> create a tar.gz archive of a watch's directory contents
list (l) -> dump database to stdout in silky smooth yaml format
for all modes the command line argument must be the name of the directory to
which to apply the operation - which defaults to the current directory.
-----------------------------------------------------------------------------
mode: create (c)
-----------------------------------------------------------------------------
initializes a storage directory with all required database files, logs,
command directories, sample configuration, sample programs, etc.
by default the storage dir will be stored in a subdirectory specfied as the
'directory' command line argument, eg:
directory/.dirwatch/
the --dirwatch_dir option can be used to specify an alternate location. this
is particularly important to use if you, for instance, have an external
program like tmpwatch running which might delete this directory!
when a dirwatch storage directory is created a few files are directories are
created underneath it. the hierarchy is
directory/.dirwatch/
commands/
logs/
db
dirwatch.conf
dirwatch.pid
where
commands/ -> any programs placed here will be automatically found as
this location is added to PATH
logs/ -> logs are kept here and are auto-rolled to no scrubbing is needed
db -> this is an sqlite database file
dirwatch.conf -> a yaml configuration file used to configure which commands
to trigger for which actions
dirwatch.pid -> a file containing the pid of the daemon process
examples:
0) initialize the directory incoming_data/ to be dirwatched using all
defaults
~ > dirwatch create incoming_data/
1) initialize the directory incoming_data/ to be dirwatched storing all
metadata in /usr/local/dirwatch/incoming_data
~ > dirwatch create incoming_data/ --dirwatch_dir=/usr/local/dirwatch/incoming_data/
-----------------------------------------------------------------------------
mode: start (S)
-----------------------------------------------------------------------------
dirwatch is normally run in daemon mode. the start mode is equivalent to
running in 'watch' mode with the '--daemon' and '--quiet' flags.
examples:
~ > dirwatch start incoming_data/
-----------------------------------------------------------------------------
mode: restart (R)
-----------------------------------------------------------------------------
'restart' mode checks a watcher's pidfile and either restarts the currently
running watcher or starts a new one as in 'start' mode. this is equivalent to
sending SIGHUP to the watcher daemon process.
examples:
~ > dirwatch restart incoming_data/
-----------------------------------------------------------------------------
mode: stop (H)
-----------------------------------------------------------------------------
'stop' mode checks for any process watching the specified directory and kills
this process if it exists. this is equivalent to sending TERM to the watcher
daemon process. the process will not exit immediately but will do at the
first possible safe opportunity. do not kill -9 the daemon process.
examples:
~ > dirwatch stop incoming_data/
-----------------------------------------------------------------------------
mode: status (T)
-----------------------------------------------------------------------------
'status' mode reports whether or not a watcher is running for the given
directory.
examples:
~ > dirwatch status incoming_data/
-----------------------------------------------------------------------------
mode: truncate (D)
-----------------------------------------------------------------------------
'truncate' (delete) mode atomically empties the database of all state.
examples:
~ > dirwatch truncate incoming_data/
-----------------------------------------------------------------------------
mode: archive (a)
-----------------------------------------------------------------------------
archive mode is used to atomically create a tgz file of a the storage
directory for a given directory while respecting the locking subsystem.
examples:
~ > dirwatch archive incoming_data/
essentially this is useful for making hot backups. you system must have the
tar command for this to operate.
-----------------------------------------------------------------------------
mode: watch (w)
-----------------------------------------------------------------------------
this is the biggie.
dirwatch is designed to run as a daemon, updating the database inventory at
the interval specified by the '--interval' option (5 minutes by default) and
firing appropriate trigger commands. two watchers may not watch the same
dir simoultaneously and attempting the start a second watcher will fail when
the second watcher is unable to obtain the pid lockfile. it is a non-fatal
error to attempt to start another watcher when one is running and this failure
can be made silent by using the '--quiet' option. the reason for this is to
allow a crontab entry to be used to make the daemon 'immortal'. for example,
the following crontab entry
*/15 * * * * dirwatch directory --daemon --dbdir=0 \
--files_only --flat \
--interval=10minutes --quiet
or (same but shorter)
*/15 * * * * dirwatch directory -D -d0 -f -F -i10m -q
will __attempt__ to start a daemon watching 'directory' every fifteen minutes.
if the daemon is not already running one will started, otherwise dirwatch will
simply fail silently (no cron email sent due to stderr).
this feature allows a normal user to setup daemon processes that not only will
run after machine reboot, but which will continue to run after other terminal
program behaviour.
the meaning of the options in the above crontab entry are as follows
--daemon -> become a child of init and run forever
--dbdir -> the storage directory, here the default is specified
--files_only -> inventory files only (default is files and directories)
--flat -> do not recurse into subdirectories (default recurses)
--interval -> generate inventory, at mininum, every 10 minutes
--quiet -> be quiet when failing due to another daemon already watching
as the watcher runs and maintains the inventory it is noted when
files/directories (entries) have been created, modified, updated, deleted, or
are existing. these entries are then handled by user definable triggers as
specified in the config file. the config file is of the format
...
actions :
created :
commands :
...
updated :
commands :
...
...
...
where the commands to be run for each trigger type are enumerated. each
command entry is of the following format:
...
-
command : command to run
type : calling convention
pattern : filter files further by this pattern
timing : synchronous or asynchronous execution
...
the meaning of each field is as follows:
command: this is the program to run. the search path for the program is
determined dynamically by the action run. for instance, when a
file is discovered to be 'modified' the search path for the
command will be
dbdir/commands/modified/ + dbdir/commands/ + $PATH
this dynamic path setting simply allows for short pathnames if
commands are stored in the dbdir/commands/* subdirectories.
type: there are four types of commands. the type merely indicates the
calling convention of the program. when commands are run there
are two peices of information which must be passed to the
program, the file in question and the mtime of that file. the
mtime is less important but programs may use it to know if the file
has been changed since they were spawned. mtime will probably be
ignored for most commands. the four types of commands fall into
two catagories: those commands called once for each file and those
types of commands called once with __all__ files
each file:
simple: the command will be called with three arguments: the file
in question, the mtime date, and the mtime time. eg:
command foobar.txt 2002-11-04 01:01:01.1234
expaned: the command will be have the strings '@file' and
'@mtime' replaced with appropriate values. eg:
command '@file' '@mtime'
expands to (and is called as)
command 'foobar.txt' '2002-11-04 01:01:01.1234'
all at once:
filter: the stdin of the program will be given a list where each
line contains three items, the file, the mtime data, and
the mtime time.
yaml: the stdin of the program will be given a list where each
entry contains two items, the file and the mtime. the
format of the list is valid yaml and the schema is an
array of hashes with the keys 'path' and 'mtime'.
pattern: all the files for a given action are filtered by this pattern,
and only those files matching pattern will have triggers fired.
timing: if timing is asynchronous the command will be run and not waited
for before starting the next command. asynchronous commands may
yield better performance but may also result in many commands
being run at once. asyncronous commands should not load the
system heavily unless one is looking to freeze a machine.
synchronous commands are spawned and waited for before the next
command is started. a side effect of synchronous commands is
that the time spent waiting may sum to an ammount of time greater
than the interval ('--interval' option) specified - if the amount
of time running commands exceeds the interval the next inventory
simply begins immeadiately with no pause. because of this one
should think of the interval used as a minimum bound only,
especially when synchronous commands are used.
note that sample commands of each type are auto-generated in the
dbdir/commands directory. reading these should answer any questions regarding
the calling conventions of any of the four types. for other questions regard
the sample config, which is also auto-generated.
-----------------------------------------------------------------------------
mode: list (l)
-----------------------------------------------------------------------------
dump the contents of the database in yaml format for easy viewing/parsing
ENVIRONMENT
for dirwatch itself:
export SLDB_DEBUG=1 -> cause sldb library actions (sql) to be logged
export LOCKFILE_DEBUG=1 -> cause lockfile library actions to be logged
for programs run by dirwatch the following environment variables will be set:
DIRWATCH_DIR -> the directory being watched
DIRWATCH_ACTION -> action type, one of 'instance', 'created', 'modified',
'updated', 'deleted', or 'existing'
DIRWATCH_TYPE -> command type, one of 'simple', 'expanded', 'filter', or
'yaml'
DIRWATCH_N_PATHS -> the total number of paths for this action. the paths
themselves will be passed to the program in a different
way depending on DIRWATCH_TYPE, for instance on the
command line or on stdin, but this number will always
be the total number of paths the program should expect.
DIRWATCH_PATH_IDX -> for some command types, like 'simple', the program will
be run more than once to handle all paths since calling
convention only allows the program to be called with
one path at a time. this number is the index of the
current path in such cases. for instance, a 'simple'
program may only be called with one path at a time so
if 10 files were created in the directory that would
result in the program being called 10 times. in each
case DIRWATCH_N_PATHS would be 10 and DIRWATCH_PATH_IDX
would range from 0 to 9 for each of the 10 calls to the
program. in the case of 'filter' and 'yaml' command
types, where every path is given at once on stdin this
value will be equal to DIRWATCH_N_PATHS
DIRWATCH_PATH -> for 'simple' and 'expanded' command types, which are
called once for each path, this will contain the path
the program is being called with. in the case of
'filter' or 'yaml' command types the varible contains
the string 'stdin' implying that all paths are
available on stdin.
DIRWATCH_MTIME -> for 'simple' and 'expanded' command types, which are
called once for each path, this will contain the mtime
the program is being called with. in the case of
'filter' or 'yaml' command types the varible contains
the string 'stdin' implying that all mtimes are
available on stdin.
DIRWATCH_PID -> the pid of dirwatch watcher process
DIRWATCH_ID -> an identifier for this action that will be unique for
any given run of a dirwatch watcher process.
restarting the watcher resets the generator. this
identifier is logged in the dirwatch watcher logs to is
useful to match program logs with dirwatch logs
PATH -> the normal shell path. for each program run the PATH
is modified to contain the commands dir of the dirwatch
watcher processs. normally this is
$DIRWATCH_DIR/.dirwatch/commands/
FILES
directory/.dirwatch/ -> dirwatch data files
directory/.dirwatch/dirwatch.conf -> default configuration file
directory/.dirwatch/commands/ -> default location for triggers
directory/.dirwatch/db -> sldb/sqlite database
directory/.dirwatch/dirwatch.pid -> default pidfile
directory/.dirwatch/logs/ -> automatically rolled log files
DIAGNOSTICS
success -> $? == 0
failure -> $? != 0
AUTHOR
(e-mail address removed)
BUGS
1 < bugno && bugno < 42
OPTIONS
--help, -h
this message
--log=path, -l
set log file - (default stderr)
--verbosity=verbostiy, -v
0|fatal < 1|error < 2|warn < 3|info < 4|debug - (default info)
--config=path
valid path - specify config file (default nil)
--template=[path]
valid path - generate a template config file in path (default stdout)
--dirwatch_dir=dirwatch_dir
specify dirwatch storage dir
--daemon, -d
specify daemon mode
--quiet, -q
be wery wery quiet
--flat, -F
do not recurse into subdirectories
--files_only, -f
consider only files
--no_follow, -n
do not follow links
--pattern=pattern, -p
consider only entries that match pattern
--n_loops=n_loops, -N
loop only this many times before exiting
--interval=interval, -i
sleep at least this long between loops
--lockfile=[lockfile], -k
specify a lockfile path
--show_input, -s
show input to all commands run