S
Sandy Norton
Hi folks,
I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:
I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:
- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .
I would like a personal document management system that:
- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems
The system should promote the following simple pattern:
receive file -> drop it into 'special' folder
after an arbitrary period of doing the above n times -> run
application
for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file
every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.
My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.
The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .
Here's the code:
<code>
import sys, time, os, zlib
import MySQLdb, _mysql
def initDB(db='test'):
connection = MySQLdb.Connect("localhost", "sa")
cursor = connection.cursor()
cursor.execute("use %s;" % db)
return (connection, cursor)
def close(connection, cursor):
connection.close()
cursor.close()
def drop_table(cursor):
try:
cursor.execute("drop table tstable")
except:
pass
def create_table(cursor):
cursor.execute('''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')
def process(data):
data = zlib.compress(data, 9)
return _mysql.escape_string(data)
def populate_table(cursor):
files = [(f, os.path.join('testdocs', f)) for f in
os.listdir('testdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute('''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath
def main ():
connection, cursor = initDB()
# doit
drop_table(cursor)
create_table(cursor)
populate_table(cursor)
close(connection, cursor)
if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'
</code>
0.0160000324249 seconds for testdocs\concept2businessprocess.pdf
0.0160000324249 seconds for testdocs\diagram.vsd
0.0149998664856 seconds for testdocs\logo.jpg
Traceback (most recent call last):
File "test_blob.py", line 59, in ?
main ()
File "test_blob.py", line 53, in main
populate_table(cursor)
File "test_blob.py", line 44, in populate_table
cursor.execute('''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 95, in execute
return self._execute(query, args)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 114, in _execute
self.errorhandler(self, exc, value)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
line 33, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')
</traceback>
My Questions are:
- Is my test code at fault?
- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?
- Am I using the wrong database? (or is the connector just buggy?)
Thanks to all.
best regards,
Sandy Norton
I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:
I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:
- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .
I would like a personal document management system that:
- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems
The system should promote the following simple pattern:
receive file -> drop it into 'special' folder
after an arbitrary period of doing the above n times -> run
application
for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file
every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.
My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.
The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .
Here's the code:
<code>
import sys, time, os, zlib
import MySQLdb, _mysql
def initDB(db='test'):
connection = MySQLdb.Connect("localhost", "sa")
cursor = connection.cursor()
cursor.execute("use %s;" % db)
return (connection, cursor)
def close(connection, cursor):
connection.close()
cursor.close()
def drop_table(cursor):
try:
cursor.execute("drop table tstable")
except:
pass
def create_table(cursor):
cursor.execute('''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')
def process(data):
data = zlib.compress(data, 9)
return _mysql.escape_string(data)
def populate_table(cursor):
files = [(f, os.path.join('testdocs', f)) for f in
os.listdir('testdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute('''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath
def main ():
connection, cursor = initDB()
# doit
drop_table(cursor)
create_table(cursor)
populate_table(cursor)
close(connection, cursor)
if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'
</code>
0.155999898911 seconds for testdocs\business plan.docpythonw -u "test_blob.py"
0.0160000324249 seconds for testdocs\concept2businessprocess.pdf
0.0160000324249 seconds for testdocs\diagram.vsd
0.0149998664856 seconds for testdocs\logo.jpg
Traceback (most recent call last):
File "test_blob.py", line 59, in ?
main ()
File "test_blob.py", line 53, in main
populate_table(cursor)
File "test_blob.py", line 44, in populate_table
cursor.execute('''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 95, in execute
return self._execute(query, args)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 114, in _execute
self.errorhandler(self, exc, value)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
line 33, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')
Exit code: 1
</traceback>
My Questions are:
- Is my test code at fault?
- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?
- Am I using the wrong database? (or is the connector just buggy?)
Thanks to all.
best regards,
Sandy Norton