XmlCrawlerConfig (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.config
Class XmlCrawlerConfig

java.lang.Object
  net.sf.regain.crawler.config.XmlCrawlerConfig

All Implemented Interfaces:: CrawlerConfig

public class XmlCrawlerConfig
extends Object
implements CrawlerConfig
extends Object
implements CrawlerConfig

Liest die konfigurierenden Einstellungen aus einer XML-Datei und stellt sie zur Verfï¿œgung.

Author:: Til Schneider, www.murfman.de

Field Summary
`private String`	`mAnalyzerType` Der zu verwendende Analyzer-Typ.
`private AuxiliaryField[]`	`mAuxiliaryFieldArr` The list of the auxiliary fields.
`private UrlMatcher[]`	`mBlackList` The black list.
`private int`	`mBreakpointInterval` The interval between two breakpoint in minutes.
`private boolean`	`mBuildIndex` Gibt an, ob ein Suchindex erstellt werden soll.
`private String`	`mCrawlerAccessControllerClass` The class name of the CrawlerAccessController to use.
`private Properties`	`mCrawlerAccessControllerConfig` The configuration of the CrawlerAccessController.
`private String`	`mCrawlerAccessControllerJar` The name of jar file to load the CrawlerAccessController from.
`private PreparatorSettings[]`	`mCrawlerPluginSettingsArr` The list with the crawler plugin settings.
`private String[]`	`mExclusionList` Enthï¿œlt alle Worte, die bei der Indizierung nicht vom Analyzer verändert werden sollen.
`private String`	`mFinishedWithFatalsFileName` Der Name der Kontrolldatei fï¿œr fehlerhafte Indexerstellung.
`private String`	`mFinishedWithoutFatalsFileName` Der Nam der Kontrolldatei fï¿œr erfolgreiche Indexerstellung.
`private UrlPattern[]`	`mHtmlParserUrlPatterns` Die UrlPattern, die der HTML-Parser nutzen soll, um URLs zu identifizieren.
`private int`	`mHttpTimeoutSecs` Der Timeout fï¿œr HTTP-Downloads.
`private String`	`mIndexDir` Das Verzeichnis, in dem der Suchindex stehen soll.
`private boolean`	`mLoadUnparsedUrls` Gibt an, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.
`private int`	`mMaxCycleCount` The maximum count of equal occurences of path-parts in an URI.
`private double`	`mMaxFailedDocuments` Der maximale Prozentsatz von gescheiterten Dokumenten (0..100), der fï¿œr die Freigabe eines Index toleriert wird.
`private int`	`mMaxFieldLength` The maximum number of terms per document.
`private int`	`mMaxSummaryLength` The maximum amount of characters which will be copied from content to summary
`private PreparatorSettings[]`	`mPreparatorSettingsArr` The list with the preparator settings.
`private String`	`mProxyHost` Der Host-Namen des Proxy-Servers.
`private String`	`mProxyPassword` Das Passwort fï¿œr die Anmeldung beim Proxy-Server.
`private String`	`mProxyPort` Der Port des Proxy-Servers.
`private String`	`mProxyUser` Der Benutzernamen fï¿œr die Anmeldung beim Proxy-Server.
`private StartUrl[]`	`mStartUrls` Die StartUrls.
`private String[]`	`mStopWordList` List of all stop words (words which will not be indexed).
`private String[]`	`mURLCleaners`
`private String[]`	`mUseLinkTextAsTitleRegexList` Die regulï¿œren Ausdrï¿œcke, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.
`private String`	`mUserAgent` The user agent the crawler should in order to identify at the HTTP server(s).
`private String[]`	`mValuePrefetchFields` The names of the fields to prefetch the destinct values for.
`private WhiteListEntry[]`	`mWhiteListEntryArr` The white list.
`private boolean`	`mWriteAnalysisFiles` Gibt an, ob Analyse-Deteien geschrieben werden sollen.
`private boolean`	`storeContentForPreview` Flag for enabling/disabling content for a preview in the result page

Constructor Summary
`XmlCrawlerConfig(File xmlFile)` Erzeugt eine neue XmlConfiguration-Instanz.

Method Summary
`String`	`getAnalyzerType()` Gibt den zu verwendenden Analyzer-Typ zurück.
`AuxiliaryField[]`	`getAuxiliaryFieldList()` Gets the list of the auxiliary fields.
`UrlMatcher[]`	`getBlackList()` Gets the black list.
`int`	`getBreakpointInterval()` Returns the interval between two breakpoint in minutes.
`boolean`	`getBuildIndex()` Gibt zurück, ob ein Suchindex erstellt werden soll.
`String`	`getCrawlerAccessControllerClass()` Gets the class name of the `CrawlerAccessController` to use.
`Properties`	`getCrawlerAccessControllerConfig()` Gets the configuration of the `CrawlerAccessController`.
`String`	`getCrawlerAccessControllerJar()` Gets the name of jar file to load the `CrawlerAccessController` from.
`PreparatorSettings[]`	`getCrawlerPluginSettingsList()` Gets the list with the crawler plugin settings.
`String[]`	`getExclusionList()` Gibt alle Worte zurück, die bei der Indizierung nicht vom Analyzer verändert werden sollen.
`String`	`getFinishedWithFatalsFileName()` Gibt den Namen der Kontrolldatei fï¿œr fehlerhafte Indexerstellung zurück.
`String`	`getFinishedWithoutFatalsFileName()` Gibt den Namen der Kontrolldatei fï¿œr erfolgreiche Indexerstellung zurück.
`UrlPattern[]`	`getHtmlParserUrlPatterns()` Gibt die UrlPattern zurück, die der HTML-Parser nutzen soll, um URLs zu identifizieren.
`int`	`getHttpTimeoutSecs()` Gibt den Timeout fï¿œr HTTP-Downloads zurück.
`String`	`getIndexDir()` Gibt das Verzeichnis zurück, in dem der Suchindex am Ende stehen soll.
`boolean`	`getLoadUnparsedUrls()` Gibt zurück, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.
`int`	`getMaxCycleCount()` Returns the maximum count of equal occurences of path-parts in an URI.
`double`	`getMaxFailedDocuments()` Gibt den maximalen Prozentsatz von gescheiterten Dokumenten zurück. (0..1) Ist das Verhï¿œlnis von gescheiterten Dokumenten zur Gesamtzahl von Dokumenten grï¿œï¿œer als dieser Prozentsatz, so wird der Index verworfen.
`int`	`getMaxFieldLength()` Returns the maximum number of terms that will be indexed for a single field in a document.
`int`	`getMaxSummaryLength()` Returns maximum amount of characters which will be copied from content to summary
`PreparatorSettings[]`	`getPreparatorSettingsList()` Gets the list with the preparator settings.
`String`	`getProxyHost()` Gibt den Host-Namen des Proxy-Servers zurück.
`String`	`getProxyPassword()` Gibt das Passwort fï¿œr die Anmeldung beim Proxy-Server zurück.
`String`	`getProxyPort()` Gibt den Port des Proxy-Servers zurück.
`String`	`getProxyUser()` Gibt den Benutzernamen fï¿œr die Anmeldung beim Proxy-Server zurück.
`StartUrl[]`	`getStartUrls()` Gibt die StartUrls zurück, bei denen der Crawler-Prozeß beginnen soll.
`String[]`	`getStopWordList()` Gibt alle Worte zurück, die nicht indiziert werden sollen.
`boolean`	`getStoreContentForPreview()` Returns the flag for enabling/disabling the content-preview
`String[]`	`getUntokenizedFieldNames()` Returns the names of the fields that shouldn't be tokenized.
`String[]`	`getURLCleaners()` Returns the URLCleaners.
`String[]`	`getUseLinkTextAsTitleRegexList()` Gibt die regulï¿œren Ausdrï¿œcke zurück, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.
`String`	`getUserAgent()` Returns the user agent the crawler should in order to identify at the HTTP server(s).
`String[]`	`getValuePrefetchFields()` The names of the fields to prefetch the destinct values for.
`WhiteListEntry[]`	`getWhiteList()` Gets the white list.
`boolean`	`getWriteAnalysisFiles()` Gibt zurück, ob Analyse-Deteien geschrieben werden sollen.
`private void`	`readAuxiliaryFieldList(org.w3c.dom.Node config)` Reads the list of auxiliary fields.
`private void`	`readBlackList(org.w3c.dom.Node config)` Reads the black list from the configuration.
`private void`	`readControlFileConfig(org.w3c.dom.Node config)` Liest die Namen der Kontrolldateien aus der Konfiguration.
`private void`	`readCrawlerAccessController(org.w3c.dom.Node config)` Reads which CrawlerAccessController to use.
`private void`	`readCrawlerPluginConfigSettingsList(org.w3c.dom.Node config, File xmlFile)` Reads the list of crawler plugin settings.
`private void`	`readHtmlParserUrlPatterns(org.w3c.dom.Node config)` Reads the URL-patterns for the old HTML-parser from the config.
`private void`	`readHttpTimeoutSecs(org.w3c.dom.Element config)` Liest den Timeout fï¿œr HTTP-Downloads aus der Konfiguration.
`private void`	`readIndexConfig(org.w3c.dom.Node config)` Liest die Einstellungen aus der Konfiguration, die den Suchindex betreffen.
`private void`	`readLoadUnparsedUrls(org.w3c.dom.Element config)` Liest aus der Konfiguration, ob Dokumente geladen werden sollen, die weder indiziert, noch auf URLs durchsucht werden.
`private void`	`readMaxCycleCount(org.w3c.dom.Element config)` Read the value for the cycle detection.
`private void`	`readMaxSummaryLength(org.w3c.dom.Element config)` Read the value for the cycle detection.
`private PreparatorConfig`	`readPreparatorConfig(org.w3c.dom.Node prepConfig, File xmlFile, String className)` Reads the configuration of a preparator from a node.
`private void`	`readPreparatorSettingsList(org.w3c.dom.Node config, File xmlFile)` Reads the list of preparator settings.
`private void`	`readProxyConfig(org.w3c.dom.Node config)` Liest die Proxy-Einstellungen aus der Konfiguration.
`private org.apache.regexp.RE`	`readRegexChild(org.w3c.dom.Node node)` Reads the regex child node from a node.
`private void`	`readStartUrls(org.w3c.dom.Node config)` Liest die Start-URLs aus der Konfiguration.
`private void`	`readURLCleaner(org.w3c.dom.Element config)` Read the URLCleaners from config.
`private void`	`readUseLinkTextAsTitleRegexList(org.w3c.dom.Node config)` Liest die Liste der regulï¿œren Ausdrï¿œcke aus der Konfiguration, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.
`private void`	`readUserAgent(org.w3c.dom.Element config)` Reads the user agent from the config.
`private void`	`readWhiteList(org.w3c.dom.Node config)` Reads the white list from the configuration.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

mProxyHost

private String mProxyHost

Der Host-Namen des Proxy-Servers.

mProxyPort

private String mProxyPort

Der Port des Proxy-Servers.

mProxyUser

private String mProxyUser

Der Benutzernamen fï¿œr die Anmeldung beim Proxy-Server.

mProxyPassword

private String mProxyPassword

Das Passwort fï¿œr die Anmeldung beim Proxy-Server.

mUserAgent

private String mUserAgent

The user agent the crawler should in order to identify at the HTTP server(s).

mLoadUnparsedUrls

private boolean mLoadUnparsedUrls

Gibt an, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.

mBuildIndex

private boolean mBuildIndex

Gibt an, ob ein Suchindex erstellt werden soll.

mHttpTimeoutSecs

private int mHttpTimeoutSecs

Der Timeout fï¿œr HTTP-Downloads. Dieser Wert bestimmt die maximale Zeit in Sekunden, die ein HTTP-Download insgesamt dauern darf.

mIndexDir

private String mIndexDir

Das Verzeichnis, in dem der Suchindex stehen soll.

mMaxFieldLength

private int mMaxFieldLength

The maximum number of terms per document.

mMaxCycleCount

private int mMaxCycleCount

The maximum count of equal occurences of path-parts in an URI.

mAnalyzerType

private String mAnalyzerType

Der zu verwendende Analyzer-Typ.

mStopWordList

private String[] mStopWordList

List of all stop words (words which will not be indexed).

mExclusionList

private String[] mExclusionList

Enthï¿œlt alle Worte, die bei der Indizierung nicht vom Analyzer verändert werden sollen.

mWriteAnalysisFiles

private boolean mWriteAnalysisFiles

Gibt an, ob Analyse-Deteien geschrieben werden sollen.

mBreakpointInterval

private int mBreakpointInterval

The interval between two breakpoint in minutes.

mMaxFailedDocuments

private double mMaxFailedDocuments

Der maximale Prozentsatz von gescheiterten Dokumenten (0..100), der fï¿œr die Freigabe eines Index toleriert wird.

mFinishedWithoutFatalsFileName

private String mFinishedWithoutFatalsFileName

Der Nam der Kontrolldatei fï¿œr erfolgreiche Indexerstellung.

mFinishedWithFatalsFileName

private String mFinishedWithFatalsFileName

Der Name der Kontrolldatei fï¿œr fehlerhafte Indexerstellung.

mStartUrls

private StartUrl[] mStartUrls

Die StartUrls.

mHtmlParserUrlPatterns

private UrlPattern[] mHtmlParserUrlPatterns

Die UrlPattern, die der HTML-Parser nutzen soll, um URLs zu identifizieren.

mBlackList

private UrlMatcher[] mBlackList

The black list.

mWhiteListEntryArr

private WhiteListEntry[] mWhiteListEntryArr

The white list.

mValuePrefetchFields

private String[] mValuePrefetchFields

The names of the fields to prefetch the destinct values for.

mUseLinkTextAsTitleRegexList

private String[] mUseLinkTextAsTitleRegexList

Die regulï¿œren Ausdrï¿œcke, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.

mPreparatorSettingsArr

private PreparatorSettings[] mPreparatorSettingsArr

The list with the preparator settings.

mCrawlerPluginSettingsArr

private PreparatorSettings[] mCrawlerPluginSettingsArr

The list with the crawler plugin settings.

mAuxiliaryFieldArr

private AuxiliaryField[] mAuxiliaryFieldArr

The list of the auxiliary fields. May be null.

mCrawlerAccessControllerClass

private String mCrawlerAccessControllerClass

The class name of the CrawlerAccessController to use.

mCrawlerAccessControllerJar

private String mCrawlerAccessControllerJar

The name of jar file to load the CrawlerAccessController from.

mCrawlerAccessControllerConfig

private Properties mCrawlerAccessControllerConfig

The configuration of the CrawlerAccessController.

mMaxSummaryLength

private int mMaxSummaryLength

The maximum amount of characters which will be copied from content to summary

storeContentForPreview

private boolean storeContentForPreview

Flag for enabling/disabling content for a preview in the result page

mURLCleaners

private String[] mURLCleaners

Constructor Detail

XmlCrawlerConfig

public XmlCrawlerConfig(File xmlFile)
                 throws RegainException

Erzeugt eine neue XmlConfiguration-Instanz.

Parameters:: xmlFile - Die XML-Datei, aus der die Konfiguration gelesen werden soll.
Throws:: RegainException - Falls die Konfiguration nicht korrekt gelesen werden konnte.

Method Detail

readURLCleaner

private void readURLCleaner(org.w3c.dom.Element config)
                     throws RegainException

Read the URLCleaners from config. URLCleaners are regex which replace parts of the URL with an empty string (in fact the remove the match from the URL.

Parameters:: config -
Throws:: RegainException

readMaxCycleCount

private void readMaxCycleCount(org.w3c.dom.Element config)
                        throws RegainException

Read the value for the cycle detection.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readLoadUnparsedUrls

private void readLoadUnparsedUrls(org.w3c.dom.Element config)
                           throws RegainException

Liest aus der Konfiguration, ob Dokumente geladen werden sollen, die weder indiziert, noch auf URLs durchsucht werden.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readHttpTimeoutSecs

private void readHttpTimeoutSecs(org.w3c.dom.Element config)
                          throws RegainException

Liest den Timeout fï¿œr HTTP-Downloads aus der Konfiguration.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readUserAgent

private void readUserAgent(org.w3c.dom.Element config)
                    throws RegainException

Reads the user agent from the config.

Parameters:: config - The configuration to read from.
Throws:: RegainException - If the configuration has an error.

readProxyConfig

private void readProxyConfig(org.w3c.dom.Node config)
                      throws RegainException

Liest die Proxy-Einstellungen aus der Konfiguration.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readIndexConfig

private void readIndexConfig(org.w3c.dom.Node config)
                      throws RegainException

Liest die Einstellungen aus der Konfiguration, die den Suchindex betreffen.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readControlFileConfig

private void readControlFileConfig(org.w3c.dom.Node config)
                            throws RegainException

Liest die Namen der Kontrolldateien aus der Konfiguration.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readStartUrls

private void readStartUrls(org.w3c.dom.Node config)
                    throws RegainException

Liest die Start-URLs aus der Konfiguration.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readHtmlParserUrlPatterns

private void readHtmlParserUrlPatterns(org.w3c.dom.Node config)
                                throws RegainException

Reads the URL-patterns for the old HTML-parser from the config.

Diese werden beim durchsuchen eines HTML-Dokuments dazu verwendet, URLs zu identifizieren.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readBlackList

private void readBlackList(org.w3c.dom.Node config)
                    throws RegainException

Reads the black list from the configuration.

Documents that have an URL that matches to one entry of the black list, won't be processed.

Parameters:: config - The configuration to read from.
Throws:: RegainException - If the configuration has an error.

readWhiteList

private void readWhiteList(org.w3c.dom.Node config)
                    throws RegainException

Reads the white list from the configuration.

Documents will only be processed if their URL matches to one entry from the white list.

Parameters:: config - The configuration to read from.
Throws:: RegainException - If the configuration has an error.

readUseLinkTextAsTitleRegexList

private void readUseLinkTextAsTitleRegexList(org.w3c.dom.Node config)
                                      throws RegainException

Liest die Liste der regulï¿œren Ausdrï¿œcke aus der Konfiguration, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

readPreparatorSettingsList

private void readPreparatorSettingsList(org.w3c.dom.Node config,
                                        File xmlFile)
                                 throws RegainException

Reads the list of preparator settings.

Parameters:: config - The configuration to read from; xmlFile - The file the configuration was read from.
Throws:: RegainException - If the configuration has errors.

readCrawlerPluginConfigSettingsList

private void readCrawlerPluginConfigSettingsList(org.w3c.dom.Node config,
                                                 File xmlFile)
                                          throws RegainException

Reads the list of crawler plugin settings. (optional)

Parameters:: config - The configuration to read from; xmlFile - The file the configuration was read from.
Throws:: RegainException - If the configuration has errors.

readAuxiliaryFieldList

private void readAuxiliaryFieldList(org.w3c.dom.Node config)
                             throws RegainException

Reads the list of auxiliary fields.

Parameters:: config - The configuration to read from
Throws:: RegainException - If the configuration has errors.

readRegexChild

private org.apache.regexp.RE readRegexChild(org.w3c.dom.Node node)
                                     throws RegainException

Reads the regex child node from a node. Can also read the old style, where the regex is directly in the node text.

Parameters:: node - The node to read the regex node from
Returns:: The compiled regular expression
Throws:: RegainException - If there is no regular expression or if the regex could not be compiled.

readPreparatorConfig

private PreparatorConfig readPreparatorConfig(org.w3c.dom.Node prepConfig,
                                              File xmlFile,
                                              String className)
                                       throws RegainException

Reads the configuration of a preparator from a node.

Parameters:: prepConfig - The node to read the preparator config from.; xmlFile - The file the configuration was read from.; className - The class name of the preparator.
Returns:: The configuration of a preparator.
Throws:: RegainException - If the configuration has errors.

readCrawlerAccessController

private void readCrawlerAccessController(org.w3c.dom.Node config)
                                  throws RegainException

Reads which CrawlerAccessController to use.

Parameters:: config - The configuration to read from.
Throws:: RegainException - If the configuration has errors.

getProxyHost

public String getProxyHost()

Gibt den Host-Namen des Proxy-Servers zurück. Wenn kein Host konfiguriert wurde, wird null zurückgegeben.

Specified by:: getProxyHost in interface CrawlerConfig

Returns:: Der Host-Namen des Proxy-Servers.

getProxyPort

public String getProxyPort()

Gibt den Port des Proxy-Servers zurück. Wenn kein Port konfiguriert wurde, wird null zurückgegeben.

Specified by:: getProxyPort in interface CrawlerConfig

Returns:: Der Port des Proxy-Servers.

getProxyUser

public String getProxyUser()

Gibt den Benutzernamen fï¿œr die Anmeldung beim Proxy-Server zurück. Wenn kein Benutzernamen konfiguriert wurde, wird null zurückgegeben.

Specified by:: getProxyUser in interface CrawlerConfig

Returns:: Der Benutzernamen fï¿œr die Anmeldung beim Proxy-Server.

getProxyPassword

public String getProxyPassword()

Gibt das Passwort fï¿œr die Anmeldung beim Proxy-Server zurück. Wenn kein Passwort konfiguriert wurde, wird null zurückgegeben.

Specified by:: getProxyPassword in interface CrawlerConfig

Returns:: Das Passwort fï¿œr die Anmeldung beim Proxy-Server.

getUserAgent

public String getUserAgent()

Description copied from interface: CrawlerConfig

Returns the user agent the crawler should in order to identify at the HTTP server(s). If null, the default (Java) user agent should be used.

Specified by:: getUserAgent in interface CrawlerConfig

Returns:: the user agent to use.

getHttpTimeoutSecs

public int getHttpTimeoutSecs()

Gibt den Timeout fï¿œr HTTP-Downloads zurück. Dieser Wert bestimmt die maximale Zeit in Sekunden, die ein HTTP-Download insgesamt dauern darf.

Specified by:: getHttpTimeoutSecs in interface CrawlerConfig

Returns:: Den Timeout fï¿œr HTTP-Downloads

getLoadUnparsedUrls

public boolean getLoadUnparsedUrls()

Gibt zurück, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.

Specified by:: getLoadUnparsedUrls in interface CrawlerConfig

Returns:: Ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.

getBuildIndex

public boolean getBuildIndex()

Gibt zurück, ob ein Suchindex erstellt werden soll.

Specified by:: getBuildIndex in interface CrawlerConfig

Returns:: Ob ein Suchindex erstellt werden soll.

getIndexDir

public String getIndexDir()

Gibt das Verzeichnis zurück, in dem der Suchindex am Ende stehen soll.

Specified by:: getIndexDir in interface CrawlerConfig

Returns:: Das Verzeichnis, in dem der Suchindex am Ende stehen soll.

getAnalyzerType

public String getAnalyzerType()

Gibt den zu verwendenden Analyzer-Typ zurück.

Specified by:: getAnalyzerType in interface CrawlerConfig

Returns:: en zu verwendenden Analyzer-Typ

getMaxFieldLength

public int getMaxFieldLength()

Description copied from interface: CrawlerConfig

Returns the maximum number of terms that will be indexed for a single field in a document.

Is <= 0 if lucene's default should be used.

Specified by:: getMaxFieldLength in interface CrawlerConfig

Returns:: the maximum number of terms per document.

getStopWordList

public String[] getStopWordList()

Gibt alle Worte zurück, die nicht indiziert werden sollen.

Specified by:: getStopWordList in interface CrawlerConfig

Returns:: Alle Worte, die nicht indiziert werden sollen.

getExclusionList

public String[] getExclusionList()

Gibt alle Worte zurück, die bei der Indizierung nicht vom Analyzer verändert werden sollen.

Specified by:: getExclusionList in interface CrawlerConfig

Returns:: Alle Worte, die bei der Indizierung nicht vom Analyzer verändert werden sollen.

getWriteAnalysisFiles

public boolean getWriteAnalysisFiles()

Gibt zurück, ob Analyse-Deteien geschrieben werden sollen.

Diese Dateien helfen, die Qualitï¿œt der Index-Erstellung zu prüfen und werden in einem Unterverzeichnis im Index-Verzeichnis angelegt.

Specified by:: getWriteAnalysisFiles in interface CrawlerConfig

Returns:: Ob Analyse-Deteien geschrieben werden sollen.

getBreakpointInterval

public int getBreakpointInterval()

Returns the interval between two breakpoint in minutes. If set to 0, no breakpoints will be created.

Specified by:: getBreakpointInterval in interface CrawlerConfig

Returns:: the interval between two breakpoint in minutes.

getMaxFailedDocuments

public double getMaxFailedDocuments()

Gibt den maximalen Prozentsatz von gescheiterten Dokumenten zurück. (0..1)

Ist das Verhï¿œlnis von gescheiterten Dokumenten zur Gesamtzahl von Dokumenten grï¿œï¿œer als dieser Prozentsatz, so wird der Index verworfen.

Gescheiterte Dokumente sind Dokumente die es entweder nicht gibt (Deadlink) oder die nicht ausgelesen werden konnten.

Specified by:: getMaxFailedDocuments in interface CrawlerConfig

Returns:: Den maximalen Prozentsatz von gescheiterten Dokumenten zurück.

getFinishedWithoutFatalsFileName

public String getFinishedWithoutFatalsFileName()

Gibt den Namen der Kontrolldatei fï¿œr erfolgreiche Indexerstellung zurück.

Diese Datei wird erzeugt, wenn der Index erstellt wurde, ohne dass fatale Fehler aufgetreten sind.

Wenn keine Kontrolldatei erzeugt werden soll, dann wird null zurückgegeben.

Specified by:: getFinishedWithoutFatalsFileName in interface CrawlerConfig

Returns:: Der Name der Kontrolldatei fï¿œr erfolgreiche Indexerstellung

getFinishedWithFatalsFileName

public String getFinishedWithFatalsFileName()

Gibt den Namen der Kontrolldatei fï¿œr fehlerhafte Indexerstellung zurück.

Diese Datei wird erzeugt, wenn der Index erstellt wurde, wobei fatale Fehler aufgetreten sind.

Wenn keine Kontrolldatei erzeugt werden soll, dann wird null zurückgegeben.

Specified by:: getFinishedWithFatalsFileName in interface CrawlerConfig

Returns:: Der Name der Kontrolldatei fï¿œr fehlerhafte Indexerstellung

getStoreContentForPreview

public boolean getStoreContentForPreview()

Returns the flag for enabling/disabling the content-preview

Specified by:: getStoreContentForPreview in interface CrawlerConfig

Returns:: boolean true if content preview is enabled and the whole content should be stored in the index

getStartUrls

public StartUrl[] getStartUrls()

Gibt die StartUrls zurück, bei denen der Crawler-Prozeß beginnen soll.

Specified by:: getStartUrls in interface CrawlerConfig

Returns:: Die StartUrls.

getHtmlParserUrlPatterns

public UrlPattern[] getHtmlParserUrlPatterns()

Gibt die UrlPattern zurück, die der HTML-Parser nutzen soll, um URLs zu identifizieren.

Specified by:: getHtmlParserUrlPatterns in interface CrawlerConfig

Returns:: Die UrlPattern fï¿œr den HTML-Parser.

getBlackList

public UrlMatcher[] getBlackList()

Gets the black list.

The black list is an array of UrlMatchers, a URLs must not match to, in order to be processed.

Specified by:: getBlackList in interface CrawlerConfig

Returns:: The black list.

getWhiteList

public WhiteListEntry[] getWhiteList()

Gets the white list.

The black list is an array of WhiteListEntry, a URLs must match to, in order to be processed.

Specified by:: getWhiteList in interface CrawlerConfig

Returns:: The white list

getValuePrefetchFields

public String[] getValuePrefetchFields()

Description copied from interface: CrawlerConfig

The names of the fields to prefetch the destinct values for.

Used for speeding up the search:input_fieldlist tag.

Specified by:: getValuePrefetchFields in interface CrawlerConfig

Returns:: the names of the fields to prefetch the destinct values for. May be null or empty.

getUseLinkTextAsTitleRegexList

public String[] getUseLinkTextAsTitleRegexList()

Gibt die regulï¿œren Ausdrï¿œcke zurück, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.

Specified by:: getUseLinkTextAsTitleRegexList in interface CrawlerConfig

Returns:: Die regulï¿œren Ausdrï¿œcke, die Dokumente bestimmen, fï¿œr die der Linktext als Titel genommen werden soll.

getPreparatorSettingsList

public PreparatorSettings[] getPreparatorSettingsList()

Gets the list with the preparator settings.

Specified by:: getPreparatorSettingsList in interface CrawlerConfig

Returns:: The list with the preparator settings.

getCrawlerPluginSettingsList

public PreparatorSettings[] getCrawlerPluginSettingsList()

Gets the list with the crawler plugin settings.

Specified by:: getCrawlerPluginSettingsList in interface CrawlerConfig

Returns:: The list with the crawler plugin settings.

getAuxiliaryFieldList

public AuxiliaryField[] getAuxiliaryFieldList()

Gets the list of the auxiliary fields.

Specified by:: getAuxiliaryFieldList in interface CrawlerConfig

Returns:: The list of the auxiliary fields. May be null.

getCrawlerAccessControllerClass

public String getCrawlerAccessControllerClass()

Gets the class name of the CrawlerAccessController to use. Returns null if no CrawlerAccessController should be used.

Specified by:: getCrawlerAccessControllerClass in interface CrawlerConfig

Returns:: The class name of the CrawlerAccessController.

getCrawlerAccessControllerJar

public String getCrawlerAccessControllerJar()

Gets the name of jar file to load the CrawlerAccessController from. Returns null if the CrawlerAccessController already is in the classpath.

Specified by:: getCrawlerAccessControllerJar in interface CrawlerConfig

Returns:: The name of jar file to load the CrawlerAccessController from.

getCrawlerAccessControllerConfig

public Properties getCrawlerAccessControllerConfig()

Gets the configuration of the CrawlerAccessController. May be null.

Specified by:: getCrawlerAccessControllerConfig in interface CrawlerConfig

Returns:: The the configuration of the CrawlerAccessController.

getMaxCycleCount

public int getMaxCycleCount()

Returns the maximum count of equal occurences of path-parts in an URI.

Specified by:: getMaxCycleCount in interface CrawlerConfig

Returns:: MaxCycleCount

getMaxSummaryLength

public int getMaxSummaryLength()

Returns maximum amount of characters which will be copied from content to summary

Specified by:: getMaxSummaryLength in interface CrawlerConfig

Returns:: MaxSummaryLength

readMaxSummaryLength

private void readMaxSummaryLength(org.w3c.dom.Element config)
                           throws RegainException

Read the value for the cycle detection.

Parameters:: config - Die Konfiguration, aus der gelesen werden soll.
Throws:: RegainException - Wenn die Konfiguration fehlerhaft ist.

getUntokenizedFieldNames

public String[] getUntokenizedFieldNames()

Returns the names of the fields that shouldn't be tokenized.

Specified by:: getUntokenizedFieldNames in interface CrawlerConfig

Returns:: The names of the fields that shouldn't be tokenized.

getURLCleaners

public String[] getURLCleaners()

Returns the URLCleaners. URLCleaners are regex which replace parts of the URL with an empty string (in fact the remove the match from the URL.

Specified by:: getURLCleaners in interface CrawlerConfig

Returns:: the paramters

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.config Class XmlCrawlerConfig

mProxyHost

mProxyPort

mProxyUser

mProxyPassword

mUserAgent

mLoadUnparsedUrls

mBuildIndex

mHttpTimeoutSecs

mIndexDir

mMaxFieldLength

mMaxCycleCount

mAnalyzerType

mStopWordList

mExclusionList

mWriteAnalysisFiles

mBreakpointInterval

mMaxFailedDocuments

mFinishedWithoutFatalsFileName

mFinishedWithFatalsFileName

mStartUrls

mHtmlParserUrlPatterns

mBlackList

mWhiteListEntryArr

mValuePrefetchFields

mUseLinkTextAsTitleRegexList

mPreparatorSettingsArr

mCrawlerPluginSettingsArr

mAuxiliaryFieldArr

mCrawlerAccessControllerClass

mCrawlerAccessControllerJar

mCrawlerAccessControllerConfig

mMaxSummaryLength

storeContentForPreview

mURLCleaners

XmlCrawlerConfig

readURLCleaner

readMaxCycleCount

readLoadUnparsedUrls

readHttpTimeoutSecs

readUserAgent

readProxyConfig

readIndexConfig

readControlFileConfig

readStartUrls

readHtmlParserUrlPatterns

readBlackList

readWhiteList

readUseLinkTextAsTitleRegexList

readPreparatorSettingsList

readCrawlerPluginConfigSettingsList

readAuxiliaryFieldList

readRegexChild

readPreparatorConfig

readCrawlerAccessController

getProxyHost

getProxyPort

getProxyUser

getProxyPassword

getUserAgent

getHttpTimeoutSecs

getLoadUnparsedUrls

getBuildIndex

getIndexDir

getAnalyzerType

getMaxFieldLength

getStopWordList

getExclusionList

getWriteAnalysisFiles

getBreakpointInterval

getMaxFailedDocuments

getFinishedWithoutFatalsFileName

getFinishedWithFatalsFileName

getStoreContentForPreview

getStartUrls

getHtmlParserUrlPatterns

getBlackList

getWhiteList

getValuePrefetchFields

getUseLinkTextAsTitleRegexList

net.sf.regain.crawler.config
Class XmlCrawlerConfig