S
Stormbringer
Hi,
I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.
For starters I used PySqLite, using a table with index on wordId. It
works fine but it consumes way too much space on the disk. For example
for 19103 pairs it takes 1686528 bytes, which amounts to around 88
bytes per pair of integers, way too much in my opinion (especially
considering the range of those integers - one is in the range
1..100000 and the other in the range 1..500000).
So my problem now is what should I use to store my data and still be
able to retrieve all messages for a given word very quickly. Either I
find a workaround in sqlite to make it store values on less bytes, or
I use something else.
I looked at the (g)dbm family but they store one entry per word and
both must be strings. To use it would mean to store a list of messages
for a word as a string, then when I find a new message containing this
word to replace this string with old string + the new message id ->
and I think this is inefficient.
At the moment I am looking at BerkeleyDB - still checking if this will
suit my needs.
Any advices and hints ? I'm looking only for embedded solutions like
PySqLite, if a database needs to install a server then it's not an
option.
Thanks in advance,
Andrei
I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.
For starters I used PySqLite, using a table with index on wordId. It
works fine but it consumes way too much space on the disk. For example
for 19103 pairs it takes 1686528 bytes, which amounts to around 88
bytes per pair of integers, way too much in my opinion (especially
considering the range of those integers - one is in the range
1..100000 and the other in the range 1..500000).
So my problem now is what should I use to store my data and still be
able to retrieve all messages for a given word very quickly. Either I
find a workaround in sqlite to make it store values on less bytes, or
I use something else.
I looked at the (g)dbm family but they store one entry per word and
both must be strings. To use it would mean to store a list of messages
for a word as a string, then when I find a new message containing this
word to replace this string with old string + the new message id ->
and I think this is inefficient.
At the moment I am looking at BerkeleyDB - still checking if this will
suit my needs.
Any advices and hints ? I'm looking only for embedded solutions like
PySqLite, if a database needs to install a server then it's not an
option.
Thanks in advance,
Andrei