V
vd12005
Hello,
While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...
i would like to use it once to build my inverted index, save it to disk
via a FileStorage,
and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...
firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?
or maybe do you know a good tutorial to understand ZODB?
thx for any help, regards.
here is a sample code :
import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent
class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term
storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']
import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()
---
i was expecting :
storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')
=> to return True
While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...
i would like to use it once to build my inverted index, save it to disk
via a FileStorage,
and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...
firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?
or maybe do you know a good tutorial to understand ZODB?
thx for any help, regards.
here is a sample code :
import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent
class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term
storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']
import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()
---
i was expecting :
storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')
=> to return True