ZODB for inverted index?

vd12005 · Oct 23, 2006

Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=> to return True

Larry Bates · Oct 23, 2006

You may want to take a quick look at ZCatalogs. They are
for indexing ZODB objects. I may not be understanding
what you are trying to do. I suspect that you really need
to store everything in a database (MySQL/Postgres/etc) for
maximal flexibility.

-Larry

vd12005 · Oct 25, 2006

thanks for your reply,

anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?

regards

Gabriel Genellina · Oct 25, 2006

At said:
anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?

What do you mean?

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar

Klaas · Oct 25, 2006

Hello,

Hi. I'm not familiar with ZODB, but you might consider berkeleydb,
which behaves like a disk-backed + memcache dictionary.

-Mike

robert · Oct 26, 2006

Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()#
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=> to return True

you have to have Persistent as base class

class IDF2(Persistent):
....

and maybe (?) reset idfs.idfs=idfs.idfs or do a idfs._p_changed=1 thing or so - don't remember the latter exactly.

but doubt if the memory management of ZODB is intelligent enough (with some extra control?) really improve your task in terms of mem usage (swapping blackout).

Other ideas:

* This is often the best method to balance mem & disk in extreme index applications: use directly the filesystem (thus (escaped) filenames/subdirs) for your index. You just append your pointers to the files. The OS cache system is already a good careful mem/disc balancer - you can do some extra cache logic in your application. This works best with filesystems who can deal well with small files
(but maybe many of your words have long index lists anyway...)
( To maybe reduce number of files/inodes bulk many items into one pickle/shleve/anddbm.. file by using sub hash keys. Example: 1 million words => 10000 files x ~100 sub-entries x 10000 refs. )

* a fast relational/dictionary database (mysql)

* Advanced memory mapped file techniques / C-OODBMS ( ObjectStore/PSE ); 64bit OS if > 3GB
( thats the technique telecoms often run their tables fast - but this is maybe too advanced ... )

-robert

Index Error during backpropagation in a multilayer neural network.	1	Jun 17, 2023
Problems with ZODB,I can not persist and object accessed from 2 threads	0	Apr 29, 2014
ZODB and Boa	4	Oct 13, 2004
ZODB: single database, multiple connections	2	Oct 30, 2006
ZODB memory problems (was: processing a Very Large file)	1	May 21, 2005
Trying ZODB, background in Relational: mimic auto_increment?	0	Aug 14, 2008
Issues with writing pytest	0	Sep 9, 2022
zodb troubles - seeking advice for app design	3	May 6, 2004

ZODB for inverted index?

vd12005

Larry Bates

vd12005

Gabriel Genellina

Klaas

robert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads