Features
regain offers lots of very useful functions that are important for an
effective search engine.
You can find more detailed information about specific features in the
regain help.
Search
-
regain uses the powerful search syntax of Lucene. Thus it is possible to
express very specific search queries. The most important possibilities are the
following:
- Boolean operators
- Wildcards
- Phonetic search
- Grouping
- and much more. You can find more information about the search syntax
here.
- Multi index search: Search multiple indexes with one search mask.
Totally transparent for the user.
- URL-Rewriting: You can use URL-Rewriting at your search. This enables
you to index documents from file://c:/www-data/intranet/docs and show them
in the browser as http://intranet.murfman.de/docs.
- Advanced search: All values that are in the index for one field may
now be provided as a drop down list on the search page. Particularly
together with auxiliary fields this is very useful.
- File-to-http-bridge: Some browsers load for security reasons no file
links from http pages. Thus all documents that are in the index are now
provided over HTTP. Of corse this may switched off and at the desktop
search these documents are only accessible from the local host.
to top
Defining the search space
Using regain you may specify very exactly what should be indexed and what
should not.
- White and black list: With a white list and a black list you may isolate
very exactly which documents the crawler should process. E.g. you may
index all from
http://www.murfman.de except for
http://www.murfman.de/dynamiccontent .
- Several sources in one index: You may index documents from different
file systems and/or web sites in the same search index.
- Partial indexing: Assumed your search index contains documents from
a network drive (file server) and a web page. You may update only the
documents from the network drive. In doing so you may update some drives
every hour and others only every week.
to top
Indexing
- Hot deployment: Change on a new search index without restarting your
servlet engine (e.g. Tomcat).
- Stopword list: Define words should not be indexed.
- Analysis files: If desired all intermediate steps of the indexing process
may be written out as files. In doing so you see exactly what gets in the
search index.
- Content extraction for HTML: Index only the actual content of your web
pages. regain removes the navigation and footer for your.
- Path extraction for HTML: Show the navigation path of your web pages in
the search results.
- Dead link detection: As a sort of by-product all found dead links
(links to non-existing documents) are written out.
- Breakpoints: The crawler creates periodically so called breakpoint.
When doing so, the current state of the search index is copied into a
separate directory. If the index update should be cancelled (e.g. if the
computer is shut down), the crawler will go on from the last breakpoint
the next time it is started.
- Auxiliary fields: The index may be extended by auxiliary fields that
are extracted from a document's URL. Example: Assumed you have a directory
with a sub directory for every project. Then you can generate an auxiliary
field with the project name. Doing so, you get only documents from
directory of the project "otto23" when searching for
"Offer project:otto23".
to top
Expandability and customization
regain is designed to be easily adaptable to your needs.
- Preparators: The preparation of a certain file format is done by so-called
preparators. Thus you are able to specify which preparators regain should
use. In addition regain may easily be extended for more file formats.
- Tag Library for the search: regain offers a Tag Library for creating the
Java Server Page for the search. Thus the adaption of the search page to
your web page's design is particularly easy.
- Configuration: regain is highly adaptable. The whole configuration of the
crawler is in one XML file.
- Access rights management: It is now possible to integrate an access
rights management, that ensures that a user only sees results for
documents he has reading rights for.
to top
|