"Standard" Full Text Search Engine

M

Martin Marcher

Hello,

is there something like a standard full text search engine?

I'm thinking of the equivalent for python like lucene is for java or
ferret for rails. Preferrably something that isn't exactly a clone of
one of those but more that is python friendly in terms of the API it
provides.

Things I'd like to have:

* different languages are supported (it seems most FTSs do only english)
* I'd like to be able to provide an identifier (if I index files in
the filesystem that would be the filename, or an ID if it lives in a
database, or whatever applies)
* I'd like to pass it just some (user defined) keywords with content,
the actual content (as string, or list of strings or whatever) and to
retrieve the results by search by keyword
* something like a priority should be assignable to different fields
(like field: title(priority=10, content="My Draft"),
keywords(priority=50, list_of_keywords))

Unnecessary:

* built-in parsing of different files

The "standard" I'm referring to would be something with a large and
active user base. Like... WSGI is _the_ thing to refer to when doing
webapps it should be something like $FTS-Engine is _the_ engine to
refer to.

any hints?
 
D

Diez B. Roggisch

Martin said:
Hello,

is there something like a standard full text search engine?

I'm thinking of the equivalent for python like lucene is for java or
ferret for rails. Preferrably something that isn't exactly a clone of
one of those but more that is python friendly in terms of the API it
provides.

Things I'd like to have:

* different languages are supported (it seems most FTSs do only english)
* I'd like to be able to provide an identifier (if I index files in
the filesystem that would be the filename, or an ID if it lives in a
database, or whatever applies)
* I'd like to pass it just some (user defined) keywords with content,
the actual content (as string, or list of strings or whatever) and to
retrieve the results by search by keyword
* something like a priority should be assignable to different fields
(like field: title(priority=10, content="My Draft"),
keywords(priority=50, list_of_keywords))

Unnecessary:

* built-in parsing of different files

The "standard" I'm referring to would be something with a large and
active user base. Like... WSGI is _the_ thing to refer to when doing
webapps it should be something like $FTS-Engine is _the_ engine to
refer to.

any hints?

There are several python lucene implementations available, and recently here
a project called NUCULAR turned up. And there is ZCatalog, the
full-text-indexing technology used in Zope, but which should be usable
outside of zope.

But "the" search-technology doesn't exist. I personally would most probably
go for the lucene-based stuff, because there you possibly get auxiliary
tools written in java.

Diez
 
S

Stephan Diehl

Martin said:
Hello,

is there something like a standard full text search engine?

I'm thinking of the equivalent for python like lucene is for java or
ferret for rails. Preferrably something that isn't exactly a clone of
one of those but more that is python friendly in terms of the API it
provides.

Things I'd like to have:

* different languages are supported (it seems most FTSs do only english)
* I'd like to be able to provide an identifier (if I index files in
the filesystem that would be the filename, or an ID if it lives in a
database, or whatever applies)
* I'd like to pass it just some (user defined) keywords with content,
the actual content (as string, or list of strings or whatever) and to
retrieve the results by search by keyword
* something like a priority should be assignable to different fields
(like field: title(priority=10, content="My Draft"),
keywords(priority=50, list_of_keywords))

Unnecessary:

* built-in parsing of different files

The "standard" I'm referring to would be something with a large and
active user base. Like... WSGI is _the_ thing to refer to when doing
webapps it should be something like $FTS-Engine is _the_ engine to
refer to.

any hints?

I'm using swish-e (swish-e.org) for all my indexing needs. I'm not sure if
there's a python binding available, I'm using swish-e as an external
executable and live quite happyly with that.
 
A

aaron.watters

There are several python lucene implementations available, and recently here
a project called NUCULAR turned up. And there is ZCatalog, the
full-text-indexing technology used in Zope, but which should be usable
outside of zope.....

Thanks for the NUCULAR mention (http://nucular.sourceforge.net). It
certainly doesn't meet all the requirements requested (very few users
yet, some
features missing). Please give it a look, however. It's easy to use
and fast. How fast it is compared to others I can't say, especially
since
some of the numbers I see quoted out there are really incredible (how
can an indexer by faster than "cp"?) -- I suspect some sort of
trickery,
frankly.

Anyway, if you want a feature like proximity searching or
some sort of internationalization support (it works with unicode, but
that's probably not enough), please let me know. I focused on
the core indexing and retrieval functionality, and I think a lot of
additional features can be added easily.

fwiw, -- Aaron Watters

===
% make love
don't know how to make love. stopping.
 
M

Martin Marcher

2007/10/26 said:
Thanks for the NUCULAR mention (http://nucular.sourceforge.net). It
certainly doesn't meet all the requirements requested (very few users
yet, some features missing). Please give it a look, however. It's easy to use
and fast. How fast it is compared to others I can't say, especially
since some of the numbers I see quoted out there are really incredible (how
can an indexer by faster than "cp"?) -- I suspect some sort of
trickery,
frankly.

For starters I think I will go with nucular. It seems good enough,
lightweight and easy to use.
Anyway, if you want a feature like proximity searching or
some sort of internationalization support (it works with unicode, but
that's probably not enough), please let me know. I focused on
the core indexing and retrieval functionality, and I think a lot of
additional features can be added easily.

I don't know much about the internals of search engines but I'll
probably report back with a few suggestions after some time of usage
:)
 
P

Paul Rubin

Martin Marcher said:
is there something like a standard full text search engine?

I'm thinking of the equivalent for python like lucene is for java or
ferret for rails. Preferrably something that isn't exactly a clone of
one of those but more that is python friendly in terms of the API it
provides.

Ferret is basically a Lucene clone, originally written in Ruby but
with the intensive parts later rewritten in C for speed since the Ruby
version was too slow. There was something similar done in Python
(PyLucene, I think) that was also pretty slow.

Solr (a wrapper around Lucene) has a reasonable set of Python
bindings. Solr has become very popular among web developers because
it's pretty easy to set up and use. However, its flexibility is not
all that great.

Nucular looks promising though still in a fairly early stage.
Suggestion for Aaron: it would be great if Nucular used the same
directives as Solr (i.e. say <field/> instead of <fld/> and fix other
such gratuitous differences) and implemented more Solr/Lucene features.
 
P

Paul Boddie

Ferret is basically a Lucene clone, originally written in Ruby but
with the intensive parts later rewritten in C for speed since the Ruby
version was too slow. There was something similar done in Python
(PyLucene, I think) that was also pretty slow.

You're thinking of Lupy, whose authors/supporters then seemed to
switch to Xapian:

http://www.divmod.org/projects/lupy

Meanwhile, PyLucene doesn't seem particularly slow to me. Provided you
can build the software (it requires gcj), it seems to work rapidly and
reliably - the only problem I've ever had was related to a threading
bug in Python 2.3 which was subsequently fixed by the Python core
developers.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top