Regain 2.1.0-STABLE API

net.sf.regain
Class RegainToolkit

java.lang.Object
  extended by net.sf.regain.RegainToolkit

public class RegainToolkit
extends Object

Enthält Hilfsmethoden, die sowohl vom Crawler als auch von der Suchmaske genutzt werden.

Author:
Til Schneider, www.murfman.de

Nested Class Summary
private static class RegainToolkit.LowercasingReader
          Liest alle Zeichen von einem eingebetteten Reader in Kleinschreibung.
private static class RegainToolkit.WrapperAnalyzer
          An analyzer that changes a document in lowercase before delivering it to a nested analyzer.
 
Field Summary
private static boolean ANALYSE_ANALYZER
          Gibt an, ob die Worte, die der Analyzer identifiziert ausgegeben werden sollen.
static String FIELD_ACCESS_CONTROL_GROUPS
          The field name where the access control groups are stored
static String INDEX_ENCODING
          The encoding used for storing URLs in the index
private static List<File> jarFolders
           
private static org.apache.lucene.util.Version LUCENE_VERSION
          The current version matching to the embedded lucene jars.
private static String mLineSeparator
          Der gecachte, systemspeziefische Zeilenumbruch.
private static String mSystemDefaultEncoding
          The cached system's default encoding.
private static int SIZE_GB
          The number of bytes in a GB (giga byte).
private static int SIZE_KB
          The number of bytes in a kB (kilo byte).
private static int SIZE_MB
          The number of bytes in a MB (mega byte).
 
Constructor Summary
RegainToolkit()
           
 
Method Summary
static void addLibraryJarPath(File file)
          Add a new library path where Jars can be loaded from.
static String bytesToString(long bytes)
          Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.
static String bytesToString(long bytes, int fractionDigits)
          Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.
static String bytesToString(long bytes, int fractionDigits, Locale locale)
          Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.
static String bytesToString(long bytes, Locale locale)
          Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.
static void checkGroupArray(Object accessController, String[] groupArr)
          Checks an array of group names.
static boolean containsWhitespace(String str)
          Checks whether the given String contains whitespace.
static void copyDirectory(File fromDir, File toDir, boolean copySubDirs)
          Copies a directory.
static void copyDirectory(File fromDir, File toDir, boolean copySubDirs, String excludeExtension)
          Copies a directory.
static void copyFile(File from, File to)
          Copies a file.
private static org.apache.lucene.analysis.Analyzer createAnalysingAnalyzer(org.apache.lucene.analysis.Analyzer nestedAnalyzer)
          Erzeugt einen Analyzer, der die Aufrufe an einen eingebetteten Analyzer analysiert.
static org.apache.lucene.analysis.Analyzer createAnalyzer(String analyzerType, String[] stopWordList, String[] exclusionList, String[] untokenizedFieldNames)
          Creates an analyzer that is used both from the crawler and the search mask.
static Object createClassInstance(String className, Class<?> superClass, ClassLoader classLoader)
          Loads a class and creates an instance.
static Object createClassInstance(String className, Class<?> superClass, String jarFileName)
          Loads a class and creates an instance.
static String createHighlightedFieldIdent(String fieldName)
          Creates a field identifier for fields with highlighted content.
static String createSummaryFromContent(String content, int maxLength)
          Creates a summary from given content The method returns null if no summary could created
static void deleteDirectory(File dir)
          Löscht ein Verzeichnis mit allen Unterverzeichnissen und -dateien.
static String fileNameToUrl(String fileName)
          Returns the URL of a file name.
static String fileToCanonicalUrl(File file)
          Gets the canonical URL of a file (no symbolic links, normalised names etc).
static String fileToUrl(File file)
          Returns the URL of a file.
static PathFilenamePair fragmentUrl(String url)
          Constructs a path-filename pair from a given URL.
static long getDirectorySize(File dir)
          Gets the size of a directory with all files.
static String getLineSeparator()
          Returns the line seperator of this operating system.
static org.apache.lucene.util.Version getLuceneVersion()
           
static String getSystemDefaultEncoding()
          Returns the system's default encoding.
static String lastModifiedToString(Date lastModified)
          Konvertiert ein Date-Objekt in einen String mit dem Format "YYYY-MM-DD HH:MM".
static void pipe(InputStream in, OutputStream out)
          Schreibt alle Daten, die der InputStream liefert in den OutputStream.
static void pipe(Reader reader, Writer writer)
          Writes all data from the reader to the writer.
static HashMap<String,String[]> readFieldValues(org.apache.lucene.index.IndexReader indexReader, String[] fieldNameArr, File indexDir)
          Returns the destinct values of one or more fields.
static String[] readListFromFile(File file)
          Reads a word list from a file.
static String readStringFromFile(File file)
          Liest einen String aus einer Datei.
static String readStringFromStream(InputStream stream)
          Reads a String from a stream.
static String readStringFromStream(InputStream stream, String charsetName)
          Reads a String from a stream.
static String removeProtocol(String path)
          Removes the protocol from a given path.
static String replace(String source, String[] patternArr, String[] replacementArr)
          Replaces in a string all occurences of a list of patterns with replacements.
static String replace(String source, String pattern, String replacement)
          Replaces in a string all occurences of pattern with replacement.
private static File searchJarFile(String jarFileName)
           
static String[] splitString(String str, String delim)
          Splits a String into a string array.
static String[] splitString(String str, String delim, boolean trimSplits)
          Splits a String into a string array.
static Date stringToLastModified(String asString)
          Konvertiert einen String mit dem Format "YYYY-MM-DD HH:MM" in ein Date-Objekt.
static String toPercentString(double value)
          Gibt einen Wert in Prozent mit zwei Nachkommastellen zur�ck.
static String toTimeString(long time)
          Gets a human readable String for a time.
static String urlDecode(String text, String encoding)
          URL-decodes a String.
static String urlEncode(String text, String encoding)
          URL-encodes a String.
static File urlToFile(String url)
          Gets the file that is described by a URL with the file:// protocol.
static String urlToFileName(String url)
          Gets the file name that is described by a URL with the file:// protocol.
static jcifs.smb.SmbFile urlToSmbFile(String url)
          Gets the smbfile that is described by a URL with the smb:// protocol.
static String urlToSmbFileName(String url)
          Gets the smb file name that is described by a URL with the smb:// protocol.
static String urlToWhitespacedFileName(String url)
          Gets the 'real' file name that is described by a URL with the file:// protocol.
static void writeListToFile(String[] wordList, File file)
          Writes a word list in a file.
static void writeToFile(byte[] data, File file)
          Writes data to a file
static void writeToFile(String text, File file)
          Writes a String into a file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INDEX_ENCODING

public static final String INDEX_ENCODING
The encoding used for storing URLs in the index

See Also:
Constant Field Values

FIELD_ACCESS_CONTROL_GROUPS

public static final String FIELD_ACCESS_CONTROL_GROUPS
The field name where the access control groups are stored

See Also:
Constant Field Values

ANALYSE_ANALYZER

private static final boolean ANALYSE_ANALYZER
Gibt an, ob die Worte, die der Analyzer identifiziert ausgegeben werden sollen.

See Also:
Constant Field Values

SIZE_KB

private static final int SIZE_KB
The number of bytes in a kB (kilo byte).

See Also:
Constant Field Values

SIZE_MB

private static final int SIZE_MB
The number of bytes in a MB (mega byte).

See Also:
Constant Field Values

SIZE_GB

private static final int SIZE_GB
The number of bytes in a GB (giga byte).

See Also:
Constant Field Values

mSystemDefaultEncoding

private static String mSystemDefaultEncoding
The cached system's default encoding.


mLineSeparator

private static String mLineSeparator
Der gecachte, systemspeziefische Zeilenumbruch.


LUCENE_VERSION

private static final org.apache.lucene.util.Version LUCENE_VERSION
The current version matching to the embedded lucene jars.


jarFolders

private static List<File> jarFolders
Constructor Detail

RegainToolkit

public RegainToolkit()
Method Detail

getLuceneVersion

public static org.apache.lucene.util.Version getLuceneVersion()

deleteDirectory

public static void deleteDirectory(File dir)
                            throws RegainException
Löscht ein Verzeichnis mit allen Unterverzeichnissen und -dateien.

Parameters:
dir - Das zu löschende Verzeichnis.
Throws:
RegainException - Wenn das L�schen fehl schlug.

pipe

public static void pipe(Reader reader,
                        Writer writer)
                 throws IOException
Writes all data from the reader to the writer.

Neither the reader nor the writer will be closed. This has to be done by the caller!

Parameters:
reader - The reader that provides the data.
writer - The writer where to write the data.
Throws:
IOException - If reading or writing failed.

pipe

public static void pipe(InputStream in,
                        OutputStream out)
                 throws IOException
Schreibt alle Daten, die der InputStream liefert in den OutputStream.

Weder der InputStream noch der OutputStream werden dabei geschlossen. Dies muss die aufrufende Methode �bernehmen!

Parameters:
in - Der InputStream, der die Daten liefert.
out - Der OutputStream auf den die Daten geschrieben werden sollen.
Throws:
IOException - Wenn Lesen oder Schreiben fehl schlug.

copyFile

public static void copyFile(File from,
                            File to)
                     throws RegainException
Copies a file.

Parameters:
from - The source file.
to - The target file.
Throws:
RegainException - If copying failed.

copyDirectory

public static void copyDirectory(File fromDir,
                                 File toDir,
                                 boolean copySubDirs,
                                 String excludeExtension)
                          throws RegainException
Copies a directory.

Parameters:
fromDir - The source directory.
toDir - The target directory.
copySubDirs - Specifies whether to copy sub directories.
excludeExtension - The file extension to exclude.
Throws:
RegainException - If copying the index failed.

copyDirectory

public static void copyDirectory(File fromDir,
                                 File toDir,
                                 boolean copySubDirs)
                          throws RegainException
Copies a directory.

Parameters:
fromDir - The source directory.
toDir - The target directory.
copySubDirs - Specifies whether to copy sub directories.
Throws:
RegainException - If copying the index failed.

readStringFromStream

public static String readStringFromStream(InputStream stream,
                                          String charsetName)
                                   throws RegainException
Reads a String from a stream.

Parameters:
stream - The stream to read the String from
charsetName - The name of the charset to use.
Returns:
The stream content as String.
Throws:
RegainException - If reading the String failed.

readStringFromStream

public static String readStringFromStream(InputStream stream)
                                   throws RegainException
Reads a String from a stream.

Parameters:
stream - The stream to read the String from
Returns:
The stream content as String.
Throws:
RegainException - If reading the String failed.

readStringFromFile

public static String readStringFromFile(File file)
                                 throws RegainException
Liest einen String aus einer Datei.

Parameters:
file - Die Datei aus der der String gelesen werden soll.
Returns:
Der Inhalt der Datei als String oder null, wenn die Datei nicht existiert.
Throws:
RegainException - Wenn das Lesen fehl schlug.

readListFromFile

public static String[] readListFromFile(File file)
                                 throws RegainException
Reads a word list from a file.

Parameters:
file - The file to read the list from.
Returns:
The lines of the file.
Throws:
RegainException - If reading failed.

writeToFile

public static void writeToFile(byte[] data,
                               File file)
                        throws RegainException
Writes data to a file

Parameters:
data - The data
file - The file to write to
Throws:
RegainException - When writing failed

writeToFile

public static void writeToFile(String text,
                               File file)
                        throws RegainException
Writes a String into a file.

Parameters:
text - The string.
file - The file to write to.
Throws:
RegainException - If writing failed.

writeListToFile

public static void writeListToFile(String[] wordList,
                                   File file)
                            throws RegainException
Writes a word list in a file. Each item of the list will be written in a line.

Parameters:
wordList - The word list.
file - The file to write to.
Throws:
RegainException - If writing failed.

getDirectorySize

public static long getDirectorySize(File dir)
Gets the size of a directory with all files.

Parameters:
dir - The directory to get the size for.
Returns:
The size of the directory.

readFieldValues

public static HashMap<String,String[]> readFieldValues(org.apache.lucene.index.IndexReader indexReader,
                                                       String[] fieldNameArr,
                                                       File indexDir)
                                                throws RegainException
Returns the destinct values of one or more fields.

If an index directory is provided, then the values will be read from there. They will be extracted from the search index if there are no matching cache files. After extracting the cache files will be created, so the next call will be faster.

Parameters:
indexReader - The index reader to use for reading the field values.
fieldNameArr - The names of the fields to read the destinct values for.
indexDir - The index directory where to read or write the cached destinct values. May be null.
Returns:
A hashmap containing for a field name (key, String) the sorted array of destinct values (value, String[]).
Throws:
RegainException - If reading from the index failed. Or if reading or writing a cache file failed.

createAnalyzer

public static org.apache.lucene.analysis.Analyzer createAnalyzer(String analyzerType,
                                                                 String[] stopWordList,
                                                                 String[] exclusionList,
                                                                 String[] untokenizedFieldNames)
                                                          throws RegainException
Creates an analyzer that is used both from the crawler and the search mask. It is important that both use the same analyzer which is the reason for this method.

Parameters:
analyzerType - The type of the analyzer to create. Either a classname or "english" or "german".
stopWordList - All words that should not be indexed.
exclusionList - All words that shouldn't be changed by the analyzer.
untokenizedFieldNames - The names of the fields that should not be tokenized.
Returns:
The analyzer.
Throws:
RegainException - If the creation failed.

createAnalysingAnalyzer

private static org.apache.lucene.analysis.Analyzer createAnalysingAnalyzer(org.apache.lucene.analysis.Analyzer nestedAnalyzer)
Erzeugt einen Analyzer, der die Aufrufe an einen eingebetteten Analyzer analysiert.

Dies ist beim Debugging hilfreich, wenn man prüfen will, was ein Analyzer bei bestimmten Anfragen ausgibt.

Parameters:
nestedAnalyzer - The nested Analyzer that should be analysed
Returns:
Ein Analyzer, der die Aufrufe an einen eingebetteten Analyzer analysiert.

replace

public static String replace(String source,
                             String pattern,
                             String replacement)
Replaces in a string all occurences of pattern with replacement.

Note: pattern may be a substring of replacement.

Parameters:
source - The string to search in
pattern - The pattern to be replaced
replacement - The replacement for each occurence of the pattern.
Returns:
A string where all occurences of pattern are replaced by replacement.

replace

public static String replace(String source,
                             String[] patternArr,
                             String[] replacementArr)
Replaces in a string all occurences of a list of patterns with replacements.

Note: The string is searched left to right. So any pattern matching earlier in the string will be replaced. Example: replace("abcd", { "bc", "ab", "cd" }, { "x", "1", "2" }) will return "12" (the pattern "bc" won't be applied, since "ab" matches before).

Note: If two patterns match at the same position, then the first one defined will be applied. Example: replace("abcd", { "ab", "abc" }, { "1", "2" }) will return "1cd".

Parameters:
source - The string to search in
patternArr - The pattern to be replaced
replacementArr - The replacement for each occurence of the pattern.
Returns:
A string where all occurences of pattern are replaced by replacement.

toPercentString

public static String toPercentString(double value)
Gibt einen Wert in Prozent mit zwei Nachkommastellen zur�ck.

Parameters:
value - Der Wert. (Zwischen 0 und 1)
Returns:
Der Wert in Prozent.

bytesToString

public static String bytesToString(long bytes)
Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.

Parameters:
bytes - Die Anzahl Bytes
Returns:
Ein String, der sie Anzahl Bytes wiedergibt

bytesToString

public static String bytesToString(long bytes,
                                   Locale locale)
Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.

Parameters:
bytes - Die Anzahl Bytes
locale - The locale to use for formatting the numbers.
Returns:
Ein String, der sie Anzahl Bytes wiedergibt

bytesToString

public static String bytesToString(long bytes,
                                   int fractionDigits)
Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.

Parameters:
bytes - Die Anzahl Bytes
fractionDigits - Die Anzahl der Nachkommastellen
Returns:
Ein String, der sie Anzahl Bytes wiedergibt

bytesToString

public static String bytesToString(long bytes,
                                   int fractionDigits,
                                   Locale locale)
Gibt einen f�r den Menschen gut lesbaren String f�r eine Anzahl Bytes zur�ck.

Parameters:
bytes - Die Anzahl Bytes
fractionDigits - Die Anzahl der Nachkommastellen
locale - The locale to use for formatting the numbers.
Returns:
Ein String, der sie Anzahl Bytes wiedergibt

toTimeString

public static String toTimeString(long time)
Gets a human readable String for a time.

Parameters:
time - The time in milliseconds.
Returns:
The time as String.

lastModifiedToString

public static String lastModifiedToString(Date lastModified)
Konvertiert ein Date-Objekt in einen String mit dem Format "YYYY-MM-DD HH:MM". Das ist nötig, um ein eindeutiges und vom Menschen lesbares Format zu haben.

Dieses Format ist mit Absicht nicht lokalisiert, um die Eindeutigkeit zu gew�hrleisten. Die Lokalisierung muss die Suchmaske �bernehmen.

Parameters:
lastModified - Das zu konvertiernende Date-Objekt
Returns:
Ein String mit dem Format "YYYY-MM-DD HH:MM"
See Also:
stringToLastModified(String)

stringToLastModified

public static Date stringToLastModified(String asString)
                                 throws RegainException
Konvertiert einen String mit dem Format "YYYY-MM-DD HH:MM" in ein Date-Objekt.

Parameters:
asString - Der zu konvertierende String
Returns:
Das konvertierte Date-Objekt.
Throws:
RegainException - Wenn der String ein falsches Format hat.
See Also:
lastModifiedToString(Date)

splitString

public static String[] splitString(String str,
                                   String delim)
Splits a String into a string array.

Parameters:
str - The String to split.
delim - The String that separates the items to split
Returns:
An array the items.

splitString

public static String[] splitString(String str,
                                   String delim,
                                   boolean trimSplits)
Splits a String into a string array.

Parameters:
str - The String to split.
delim - The String that separates the items to split
trimSplits - Specifies whether String.trim() should be called for every split.
Returns:
An array the items.

getLineSeparator

public static String getLineSeparator()
Returns the line seperator of this operating system.

Returns:
\n or \r\n or \n\r, according to what JVM specifies.

getSystemDefaultEncoding

public static String getSystemDefaultEncoding()
Returns the system's default encoding.

Returns:
the system's default encoding.

containsWhitespace

public static boolean containsWhitespace(String str)
Checks whether the given String contains whitespace.

Parameters:
str - The String to check.
Returns:
Whether the given String contains whitespace.

checkGroupArray

public static void checkGroupArray(Object accessController,
                                   String[] groupArr)
                            throws RegainException
Checks an array of group names.

Parameters:
accessController - The (search or crawler) access controller that returned the array of group names.
groupArr - The array of group names to check.
Throws:
RegainException - If the array of group names is not valid.

createClassInstance

public static Object createClassInstance(String className,
                                         Class<?> superClass,
                                         ClassLoader classLoader)
                                  throws RegainException
Loads a class and creates an instance.

Parameters:
className - The name of the class to load and create an instance of.
superClass - The super class the class must extend.
classLoader - The class loader to use for loading the class. May be null
Returns:
An object of the class.
Throws:
RegainException - If loading the class or creating the instance failed or if the class is no instance of the given super class.

addLibraryJarPath

public static void addLibraryJarPath(File file)
Add a new library path where Jars can be loaded from.

Parameters:
file - Filename of a directory - non-existing directory are silently discarded.

searchJarFile

private static File searchJarFile(String jarFileName)

createClassInstance

public static Object createClassInstance(String className,
                                         Class<?> superClass,
                                         String jarFileName)
                                  throws RegainException
Loads a class and creates an instance.

Parameters:
className - The name of the class to load and create an instance of.
superClass - The super class the class must extend.
jarFileName - The name of the jar file to load the class from. May be null or relative to a library path.
Returns:
An object of the class.
Throws:
RegainException - If loading the class or creating the instance failed or if the class is no instance of the given super class.

urlToFileName

public static String urlToFileName(String url)
                            throws RegainException
Gets the file name that is described by a URL with the file:// protocol.

Parameters:
url - The URL to get the file name for.
Returns:
The file name that matches the URL.
Throws:
RegainException - If the URL's protocol isn't file://.

urlToWhitespacedFileName

public static String urlToWhitespacedFileName(String url)
                                       throws RegainException
Gets the 'real' file name that is described by a URL with the file:// protocol. This file name does not contain a path, protocol and drive-letter

Parameters:
url - The URL to extract the file name from.
Returns:
The file name that matches the URL.
Throws:
RegainException.
RegainException

fragmentUrl

public static PathFilenamePair fragmentUrl(String url)
                                    throws RegainException
Constructs a path-filename pair from a given URL.

Parameters:
url - the url
Returns:
a path-filename pair
Throws:
RegainException

removeProtocol

public static String removeProtocol(String path)
Removes the protocol from a given path.

Parameters:
path - the path
Returns:
a path without a protocol

urlToFile

public static File urlToFile(String url)
                      throws RegainException
Gets the file that is described by a URL with the file:// protocol.

Parameters:
url - The URL to get the file for.
Returns:
The file that matches the URL.
Throws:
RegainException - If the URL's protocol isn't file://.

urlToSmbFile

public static jcifs.smb.SmbFile urlToSmbFile(String url)
                                      throws RegainException
Gets the smbfile that is described by a URL with the smb:// protocol.

Parameters:
url - The URL to get the smbfile for.
Returns:
The smbfile that matches the URL.
Throws:
RegainException - If the URL's protocol isn't smb://.

urlToSmbFileName

public static String urlToSmbFileName(String url)
                               throws RegainException
Gets the smb file name that is described by a URL with the smb:// protocol.

Parameters:
url - The URL to get the file name for.
Returns:
The smb file name that matches the URL.
Throws:
RegainException - If the URL's protocol isn't smb://.

fileNameToUrl

public static String fileNameToUrl(String fileName)
                            throws RegainException
Returns the URL of a file name.

Parameters:
fileName - The file name to get the URL for
Returns:
The URL of the file.
Throws:
RegainException - If URL-encoding failed.

fileToUrl

public static String fileToUrl(File file)
                        throws RegainException
Returns the URL of a file.

Parameters:
file - The file to get the URL for
Returns:
The URL of the file.
Throws:
RegainException - If URL-encoding failed.

fileToCanonicalUrl

public static String fileToCanonicalUrl(File file)
                                 throws RegainException
Gets the canonical URL of a file (no symbolic links, normalised names etc). Symbolic link detection may fail in certain situations, like for NFS file systems

Parameters:
file - The file to get the canonical URL for
Returns:
The URL of the file.
Throws:
RegainException - If URL-encoding failed.

urlEncode

public static String urlEncode(String text,
                               String encoding)
                        throws RegainException
URL-encodes a String.

Parameters:
text - The String to URL-encode.
encoding - The encoding to use.
Returns:
The URL-encoded String.
Throws:
RegainException - If URL-encoding failed.

urlDecode

public static String urlDecode(String text,
                               String encoding)
                        throws RegainException
URL-decodes a String.

Parameters:
text - The String to URL-decode.
encoding - The encoding to use.
Returns:
The URL-decoded String.
Throws:
RegainException - If URL-decoding failed.

createSummaryFromContent

public static String createSummaryFromContent(String content,
                                              int maxLength)
Creates a summary from given content

The method returns null if no summary could created

Parameters:
content - The content for which the summary is referring to
maxLength - The maximum length of the created summary
Returns:
The summary (first n characters of content

createHighlightedFieldIdent

public static String createHighlightedFieldIdent(String fieldName)
Creates a field identifier for fields with highlighted content. All high- lighted content will be stored in a field named 'highlightedOldfieldname' where oldfieldname was in lowercase before renaming.

The method returns null if no field identifier could created

Parameters:
fieldName - The content for which the summary is referring to
Returns:
the new field identifier

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info