Technical Details
Technology
regain is based on
Jakarta Lucene,
a library for creating and searching search indices.
regain itself is 100% pure Java. The non-Java parts are plugins that read the
formats Excel, Powerpoint and Word. For the formats Excel and Word however
there are alternatives in 100% pure Java.
Searching with regain
The work of regain is split in two parts: The
creation of the search index and the
search on the search index.
The following image shows you an overwiew about how regain searches.
The crawler searches a website or a directory tree for documents. In the
configuration you may specify what exactly should be crawled. From each
document the actual text is extracted using so-called preparators. The text is
added to the search index.
 to top
After you've created a search index, you are able to perform searches. The
search index is built in such way that searching is very fast.
And this already is the whole trick of search machines: The time you need for
a full text search is moved from the actual search (where a user waits for the
results) to the index creation (which runs automatically in the background)
using a clever search index.
 to top
Rating the search results
The search results are rated after the relative frequency of the search terms
in the document. If a search term appears very often in a document, it will
appear more on the top. In doing so, the length of a document is considered as
well: A document with 100 words that contains a search term 5 times will be
rated as a better hit than a document with 1000 words containing the search
term 10 times.
 to top
|