News
2014-07-31, Til Schneider
Version 2.1.0 STABLE has been released
- Auxiliary fields can now use the (non-URL-encoded) file system path instead of the URL.
2013-03-21, Thomas Tesche
Version 2.0.4 STABLE has been released
- Bugfix: Add Office new xml/zip mime-types to search result request handler
2013-03-02, Thomas Tesche
Version 2.0.2 STABLE has been released
- Bugfix: Regression: Search multiple indexes (didn't work in Version 2.0.0: "this IndexReader is already closed"-Exception)
2013-02-16, Thomas Tesche
Version 2.0.1 STABLE has been released
- Update: PDFPreparator (add missing libs, upgrade to PDFBox 1.7.1)
2012-12-06, Benjamin Pick, Thomas Tesche
Version 2.0.0 STABLE has been released
- Try all other preparators in case the designated preperator doesn't work
- Allow more than one wildcard mime declarations (positiv, negativ). Example: mimetype:"text/*" -mimetype:"text/plain"
- HtmlPreparator: Follows links in framesets
- JavaPreperator: Enums will be parsed as well (Enum name and constants will be extracted)
- MP3Peparator works GenericAudioPreparator could handle read only files
- IndexWriterManager: Close all files to prevent for "Too many open files..." and "this IndexReader is closed" error
2012-06-20, Benjamin Pick, Thomas Tesche
Version 1.8.0 PREVIEW has been released
- Lucene has been updated from 3.1 to 3.6.
- Desktop Version: Configuration now uses one-time tokens.
- Improve HTML Escaping for JSP Tags.
2012-04-12, Benjamin Pick, Thomas Tesche
Version 1.7.12 PREVIEW has been released
- Some bugfixes and new features related to custom SearchAccessController/CrawlerAccessController
- Update POI Library to 3.8.
- XML Interface now encodes correctly.
2012-02-03, Benjamin Pick, Thomas Tesche
Version 1.7.11 PREVIEW has been released
- NEW: Introducing dynamic blacklisting.
- Update Crawler Thumbnailer to version 0.3 (Stable).
2011-12-11, Thomas Tesche, Benjamin Pick
Version 1.7.10 PREVIEW has been released
- NEW: Add XML interface to search results (search_xml.jsp). The URL parameters are identical to search.jsp.
- All tags now have an optional attribute 'escape' to escape its output (html, xml or none)
- JarPreparator which indexes the filenames in *.jar, *.war, *.ear archives.
- ZipPreparator which indexes the filenames in *.zip archives.
- BUGFIX: DocumentFactory.createDocument(): IO-Stream was never closed
- IndexUpdateManager.checkUpdate(): IO-Stream was never closed
- No NullpointerException if AnalyzerType unknown; instead, re-index
- build.xml: Allow compiling when only JAVA_HOME is set
- build.xml: Throw message when build.properties not found
- Regression from 1.7.9: File2Http-Bridge failed to work
2011-08-16, Thomas Tesche
Version 1.7.9 PREVIEW has been released
- CONFIGURATION: Namespaces of Desktop Taglib are now in DesktopConfiguration.xml.
- Taglib Classes can be added in web/taglib as namespace.jar (Desktop)
- Bugfix: Crawler didn't work (could not access to search.IndexConfig.getLuceneVersion())
2011-07-31, Thomas Tesche
Version 1.7.8 PREVIEW has been released
- New: Crawler Plugin Infrastructure (contrib. benjamin) see also the documentation.
- Update of PDFBOX: Improvements in performance and stability of text extraction from PDF files.
- Bugfix: The field 'filename' will be searched correctly.
2011-06-03, Thomas Tesche
Version 1.7.7 STABLE has been released
- Add icons for docx, xlsx, pptx on the search result.
- Improve error handling and message for a non-existent conf-dir
- Lib-Updates: PDFBox, Lucene, JaudioTager
- Add predefined analyzer definitions for french and italian analyzers.
- english stopwords example in CrawlerConfiguration.xml
- english response in FormTag (settings)
- Annotations extraction from PDF documents
- System Tray will be correctly displayed on 64-bit Systems.
2010-12-21
Version 1.7.3 STABLE has been released
- Updates for Lucene, PDFBox, POI (office docs), Aperture and jcifs (samba access)
- Meta data extraction for PDF and all office documents
- BUGFIX: TrayIcon for Linux (32bit)
2010-09-26
Version 1.7.0 STABLE has been released
- Authentication for http now possible.
- List of all crawled URL will be written at the end of crawling.
- BUGFIX: Added missing Apache Commons libs.
2009-11-29
Version 1.6.4 has been released
- New Sorting: relevance, last-modified, size, title, mimetype, path, filename. The sorting feature has to be configured in the SearchConfiguration.xml.
Only sorting by relevance is enabled in the delivered default config (same behaviour as before on the search results page).
- The last-modified date will be displayed on every search hit.
- UPDATED: Lucene to version 2.9.1
- UPDATED: PDFBox to version 0.8.0
- Obsolete PoiMsWord, -Excel, -Powerpoint and -VisioPreparator classes are removed.
- DEPRECATED: SingleSearchResults, MultipleSearchResults, MergedHits. These classes will be removed in one of the next distribution (dependÃng on
the Lucene 3.0 update)
2009-09-05
Version 1.6.2 has been released
- URLCleaner for removing parts from the URL (e.g.session id)
- regain is build again with Java 1.6 because some of the libs requires 1.6
- BUGFIX: Rtf documents will be indexed correctly (replace wrong mimetype, disable SimpleRTFPreparator)
- <starturls/>, <whitelist/> and <blacklist/> accepts <![CDATA[]]> sections. Usefull for URLs which contains an &.
- BUGFIX: Multiple protocols occurs some times when using the setup page of the desktop version.
- BUGFIX: Update of samba lib (jcifs). Handling of smb-URL improved (username, password no longer stored in index).
- BUGFIX: 'Cached' text is now internationalized
- Example of French stoppwords in CrawlerConfiguration.xml
2009-03-08
Version 1.6 has been released
- Left truncation of query terms: *ack finds back, stack and so on
- Highlighting for wildcard- and fuzzysearches (contribution: A.Larsson)
- BUGFIX: local files couldn't executed after execution of rewriteRules
- Mime messages will be fetched only once from the imap server. In the case of wanted reindexing the documents or the complete index have to be dropped.
- New input field for imap(s) urls on the config page (desktop search)
- Content storage (for the new preview function from 1.5.2) could be disabled with <storeContentForPreview/> {true,false} in the Crawler configuration.
2008-08-07
Version 1.5.1 has been released
- Bugfix: In some cases no file contents were indexed.
- mp3-Preparator extracts ID3v2 or ID3v1 tags
- Generic audio preparator, which extracts metadata from mp4 (iTunes), ogg-Vorbis and flac
- JavaPreparator for *.java-files (not inclluded in the 'standard'-distribution)
- smb/CIFS driver
- new HTML-Parser for better content extraction
- Bugfix: Filename will be indexed correctly
- Link extraction with Regexp switched to HTMLParser link extraction
- Priority for Preparators
- Highlighting for content and title
- Preparator selektion on basis of mime-types
- Mimetype-detection (based on file extension and MagicMime)
- Replacement "Extended Search" from extension selection to mimetype selection
- -notrayicon - command line parameter for desktop-search (TrayIcon will never be shown, code contribution by Stefan Gottlieb)
- Lucene updated to version 2.3.2
- Date format of field 'last-modified' changed to "YYYYMMDD". By now range search could be applied to the field. (code contribution by filiadat)
- Bugifx: default locale-Handling in SharedTag (code contribution by filiadata)
- Bugfix: Removing anchors from URLs
- Definition of a default index update interval
- Bugfix: Deletion of temporary files is handled safer
- Bugfix: Improved mime-type detection
- Bugfix: http://-links which ends with a / will be extracted and indexed
2007-12-01
Version 1.2.3 has been released
- Bugfix: In some cases no file contents were indexed.
2007-11-01
Version 1.2.2 has been released
- It is now possible to use any Lucene analyzer.
- Bugfix: The attibute beautified of hit_url tag was missing in the TLD definition.
- Bugfix: Fixed URL-encoding problems when using the file-to-http-bridge.
2007-10-30
Version 1.2.1 has been released
- Bugfix: In regain 1.2 some libs were missing. This was fixed with version 1.2.1.
2007-10-20
Version 1.2 has been released
- The search result show now icons indicating the file's type.
- The index fields "size" and "last-modified" are now searchable.
- New preparator: EmptyPreparator (Contributed by Gerhard Olsson). This
preparator extracts no content files assigned to it. Therefore only the path
and the filename is written to the index (helpful for all file types having no
matching preparator).
- The maximum number of terms per document is now configurable using the
maxFieldLength tag in the CrawlerConfiguration.xml. Default is 10000.
- The IfilterPreparator works now under Windows Server 2003, too.
- The values for the search:input_fieldlist tag may now be determined when
indexing. Therefore this operation beeing slow for large indexes must not be
executed any more when searching the first time. This may be configured
using the valuePrefetchFields tag in the CrawlerConfiguration.xml.
- Several bugfixes
2006-03-27
Version 1.1.1 has been released
There were two bugs in the server variant, which are fixed in the new version.
2006-02-26
Final version 1.1 has been released
- regain now searches the URLs, too.
- The desktop search now shows the last log messages.
- Better handling of HTTP-Redirects. (Thanks to Gerhard Olsson)
- Auxiliary fields have new options: "tokenize", "store" and "index".
- Added documentation of the Tag Library.
- The search mask now accepts multiple "query" parameters
(they are just concatinated)
- The Jacob preparators have been improved. (Thanks to Reinhard Balling)
- New preparator ExternalPrepartor: This preparator calls an external program
or script in order to get the text of documents. (Thanks to Paul Ortyl)
- Completed italian localization. (Thanks to Franco Lombardo)
- Some Bugfixes
2005-12-05
Version 1.1 Beta 6 has been released
- New preparator: With the PoiMsPowerPointPreparator there is now a platform
independant preparator for Powerpoint. (Thanks to Gerhard Olsson)
- New preparator: The IfilterPreparator uses the I-Filter interface of
Microsoft to read various file formats. Unfortunately it only works on Windows.
- Multi index search: In the SearchConfiguration.xml now several indexes may be
specified as default.
- The auxiliary fields have now a better handling for case sensitivity.
- The HTTP agent sent by the crawler to the web servers may now be configured in
the CrawlerConfiguration.xml. This way the crawler can identify itself as
Internet Explorer, for example.
- Several bugfixes
2005-08-15
Error in version 1.1 Beta 5
Unfortunatly there are two libraries missing in version 1.1 Beta 5, so the
search mask does not work. Thus, I provided a corrected version
1.1 Beta 5a. This error only concerns the server-variant, not the
desktop-variant.
2005-08-13
Version 1.1 Beta 5 has been released
- Multi index search: It is now possible to search several search indexes over
one search mask. The search query is executed on every index and is merged
afterwards.
- The white and the black list now allow regular expressions, too.
- Search mask: The location of ressources and the configuration is better
detected now. Therefor regain works properly now, even if Tomcat is running
as service.
- Search mask: The file-to-http-bridge is may be switched off now.
- Crawler: The crawler needs less memory now when crawling directories.
- Crawler: The crawler now adds failed documents as well to the index. Therefore
they are not retried the next time the crawler is running. But if the crawler
is executed with the option "-retryFailedDocs", all failed documents are
retried.
- The HTML preparator now preparates the extensions .jsp, .php, .php3, .php4
and .asp as well.
- It's now possible to specify in the CrawlerConfiguration.xml which documents
should be preparated with a certain preparator.
- Several bugfixes
2005-04-13
Version 1.1 Beta 4 has been released
- Access rights management: It is now possible to integrate an access rights
management, that ensures that a user only sees results for documents he has
reading rights for.
- Search: The search taglib has now a tag "hit_field", that writes an index
field value. The tag "hit_summary" was thereby removed.
- Search: If you don't want to read the search config from an XML file or if
you don't want to write the location of the XML file in the web.xml, you
may write your own SearchConfigFactory class and create the config on your
own. The SearchConfigFactory class is specified in the web.xml.
- Server search: The enclosed JSP pages did not work.
2005-03-17
Version 1.1 Beta 3 has been released
- Crawler: Bugfix: The PoiMsExcelPreparator did not get on with all number and date formats.
- Crawler: The error log is now more detailed (with stacktraces).
- Crawler: Preparators are now encapsulated in own jars. Thus the regain.jar only contains what regain itself needs and preparators may be replaced more easily. Also other developers may provide preparators that can be mounted very easily.
The configuration of the preparators is still in the CrawlerConfiguration.xml. But now not all preparators must be declared. The preparators are executed in the same order as they are configured, the unconfigured preparators afterwards.
- Desktop search: The desktop search now runs under Linux, too.
- Search: Bugfix: Files whichs URL contains a double slash (e.g. of network drives: //fileserver/bla/blubb) couldn't be loaded.
- Desktop search: Bugfix: At the search results umlauts were presentet wrong.
- Desktop search: On the status page a currently running index update can now be paused and an index update can be started.
- Crawler: Bugfix: The HtmlPräparator did not get on with all files.
2005-03-12
Version 1.1 Beta 2 has been released
- Crawler: The crawler now creates periodically so called breakpoint. When doing so the current state of the search index is copied into a separate directory. If the index update should be cancelled (e.g. if the computer is shut down), the crawler will go on from the last breakpoint.
- Desktop search: The status page now shows the timing results.
2005-03-10
Version 1.1 Beta 1 has been released
- Desktop search: regain now provides a desktop search besides the server search.
The desktop searchs provides many features that makes the use as easy as winking:
- An installer for Windows.
- Integration in the task bar under Linux and Windows.
- Configuration over the browser.
- Status monitoring over the browser.
- Crawler: There is now a preparator for OpenOffice and StarOffice documents.
- All: Updated to the newest versions of the used projects.
- Crawler: Preparators are now configurable by the CrawlerConfiguration.xml.
- Search: The Search is now configured by the SearchConfiguration.xml, not the web.xml any more. There is only the path to the SearchConfiguration.xml any more.
- Search: The Search now provides URL rewriting. This way you can index documents in file://c:/www-data/intranet/docs and show the documents in the browser as http://intranet.murfman.de/docs.
- Crawler: Auxiliary fields: The index may now be extended by auxiliary fields that are extracted from a document's URL.
Example: Assumed you have a directory with a sub directory for every project. Then you can generate an auxiliary field with the project name. Doing so you
get only documents from that directory when searching for "Offer project:otto23".
- Search: Advanced search: All values that are in the index for one field may now be provided as a combo box on the search page. Particularly together with auxiliary fields this is very useful.
- Search: Some browsers load for security reasons no file links from http pages. Thus all documents that are in the index are now provided over HTTP. Of corse at the desktop search these documents are only accessible from the local host.
- Crawler: The JacobMsWordPreparator now regards styles. Thus it is possible to extract headlines that will be weight more when searching.
- Crawler: The JacobMsOfficePreparators are now able to extract the description fields (Title, author, etc.)
2004-07-28
The first version of regain has been released!
Today I received the official confirmation from dm-drogerie markt, which
permits it to me to publish regain under the LGPL. The project is as from now
available at the download area resp. from the CVS server of sourceforge.
|