DummyCrawlerConfig (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.config
Class DummyCrawlerConfig

java.lang.Object
  net.sf.regain.crawler.config.DummyCrawlerConfig

All Implemented Interfaces:: CrawlerConfig

public class DummyCrawlerConfig
extends Object
implements CrawlerConfig
extends Object
implements CrawlerConfig

Stellt alle zu konfigurierenden Einstellungen hardcodiert zur Verfügung.

Author:: Til Schneider, www.murfman.de

Constructor Summary
`DummyCrawlerConfig()`

Method Summary
`String`	`getAnalyzerType()` Gibt den zu verwendenden Analyzer-Typ zur�ck.
`AuxiliaryField[]`	`getAuxiliaryFieldList()` Gets the list of the auxiliary fields.
`UrlMatcher[]`	`getBlackList()` Gets the black list.
`int`	`getBreakpointInterval()` Returns the interval between two breakpoint in minutes.
`boolean`	`getBuildIndex()` Gibt zur�ck, ob ein Suchindex erstellt werden soll.
`String`	`getCrawlerAccessControllerClass()` Gets the class name of the `CrawlerAccessController` to use.
`Properties`	`getCrawlerAccessControllerConfig()` Gets the configuration of the `CrawlerAccessController`.
`String`	`getCrawlerAccessControllerJar()` Gets the name of jar file to load the `CrawlerAccessController` from.
`PreparatorSettings[]`	`getCrawlerPluginSettingsList()` Gets the list with the crawler plugin settings.
`String[]`	`getExclusionList()` Gibt alle Worte zur�ck, die bei der Indizierung nicht vom Analyzer ver�ndert werden sollen.
`String`	`getFinishedWithFatalsFileName()` Gibt den Namen der Kontrolldatei f�r fehlerhafte Indexerstellung zur�ck.
`String`	`getFinishedWithoutFatalsFileName()` Gibt den Namen der Kontrolldatei f�r erfolgreiche Indexerstellung zur�ck.
`UrlPattern[]`	`getHtmlParserUrlPatterns()` Gibt die UrlPattern zur�ck, die der HTML-Parser nutzen soll, um URLs zu identifizieren.
`int`	`getHttpTimeoutSecs()` Gibt den Timeout f�r HTTP-Downloads zur�ck.
`String`	`getIndexDir()` Gibt das Verzeichnis zur�ck, in dem der stehen soll.
`boolean`	`getLoadUnparsedUrls()` Gibt zur�ck, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.
`int`	`getMaxCycleCount()` Returns the maximum count of equal occurences of path-parts in an URI.
`double`	`getMaxFailedDocuments()` Gibt den maximalen Prozentsatz von gescheiterten Dokumenten zur�ck. (0..1) Ist das Verh�lnis von gescheiterten Dokumenten zur Gesamtzahl von Dokumenten gr��er als dieser Prozentsatz, so wird der Index verworfen.
`int`	`getMaxFieldLength()` Returns the maximum number of terms that will be indexed for a single field in a document.
`int`	`getMaxSummaryLength()` Returns maximum amount of characters which will be copied from content to summary
`PreparatorSettings[]`	`getPreparatorSettingsList()` Gets the list with the preparator settings.
`String`	`getProxyHost()` Gibt den Host-Namen des Proxy-Servers zur�ck.
`String`	`getProxyPassword()` Gibt das Passwort f�r die Anmeldung beim Proxy-Server zur�ck.
`String`	`getProxyPort()` Gibt den Port des Proxy-Servers zur�ck.
`String`	`getProxyUser()` Gibt den Benutzernamen f�r die Anmeldung beim Proxy-Server zur�ck.
`StartUrl[]`	`getStartUrls()` Gibt die StartUrls zur�ck, bei denen der Crawler-Proze� beginnen soll.
`String[]`	`getStopWordList()` Gibt alle Worte zur�ck, die nicht indiziert werden sollen.
`boolean`	`getStoreContentForPreview()` Returns the flag for enabling/disabling the content-preview
`String[]`	`getUntokenizedFieldNames()` Returns the names of the fields that shouldn't be tokenized.
`String[]`	`getURLCleaners()` Returns the URLCleaners.
`String[]`	`getUseLinkTextAsTitleRegexList()` Gibt die regul�ren Ausdr�cke zur�ck, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.
`String`	`getUserAgent()` Returns the user agent the crawler should in order to identify at the HTTP server(s).
`String[]`	`getValuePrefetchFields()` The names of the fields to prefetch the destinct values for.
`WhiteListEntry[]`	`getWhiteList()` Gets the white list.
`boolean`	`getWriteAnalysisFiles()` Gibt zur�ck, ob Analyse-Deteien geschrieben werden sollen.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

DummyCrawlerConfig

public DummyCrawlerConfig()

Method Detail

getStoreContentForPreview

public boolean getStoreContentForPreview()

Returns the flag for enabling/disabling the content-preview

Specified by:: getStoreContentForPreview in interface CrawlerConfig

Returns:: boolean true if content preview is enabled and the whole content should be stored in the index

getProxyHost

public String getProxyHost()

Gibt den Host-Namen des Proxy-Servers zur�ck. Wenn kein Host konfiguriert wurde, wird null zur�ckgegeben.

Specified by:: getProxyHost in interface CrawlerConfig

Returns:: Der Host-Namen des Proxy-Servers.

getMaxCycleCount

public int getMaxCycleCount()

Returns the maximum count of equal occurences of path-parts in an URI.

Specified by:: getMaxCycleCount in interface CrawlerConfig

Returns:: MaxCycleCount

getProxyPort

public String getProxyPort()

Gibt den Port des Proxy-Servers zur�ck. Wenn kein Port konfiguriert wurde, wird null zur�ckgegeben.

Specified by:: getProxyPort in interface CrawlerConfig

Returns:: Der Port des Proxy-Servers.

getProxyUser

public String getProxyUser()

Gibt den Benutzernamen f�r die Anmeldung beim Proxy-Server zur�ck. Wenn kein Benutzernamen konfiguriert wurde, wird null zur�ckgegeben.

Specified by:: getProxyUser in interface CrawlerConfig

Returns:: Der Benutzernamen f�r die Anmeldung beim Proxy-Server.

getProxyPassword

public String getProxyPassword()

Gibt das Passwort f�r die Anmeldung beim Proxy-Server zur�ck. Wenn kein Passwort konfiguriert wurde, wird null zur�ckgegeben.

Specified by:: getProxyPassword in interface CrawlerConfig

Returns:: Das Passwort f�r die Anmeldung beim Proxy-Server.

getUserAgent

public String getUserAgent()

Description copied from interface: CrawlerConfig

Returns the user agent the crawler should in order to identify at the HTTP server(s). If null, the default (Java) user agent should be used.

Specified by:: getUserAgent in interface CrawlerConfig

Returns:: the user agent to use.

getHttpTimeoutSecs

public int getHttpTimeoutSecs()

Gibt den Timeout f�r HTTP-Downloads zur�ck. Dieser Wert bestimmt die maximale Zeit in Sekunden, die ein HTTP-Download insgesamt dauern darf.

Specified by:: getHttpTimeoutSecs in interface CrawlerConfig

Returns:: Den Timeout f�r HTTP-Downloads

getLoadUnparsedUrls

public boolean getLoadUnparsedUrls()

Gibt zur�ck, ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.

Specified by:: getLoadUnparsedUrls in interface CrawlerConfig

Returns:: Ob URLs geladen werden sollen, die weder durchsucht noch indiziert werden.

getBuildIndex

public boolean getBuildIndex()

Gibt zur�ck, ob ein Suchindex erstellt werden soll.

Specified by:: getBuildIndex in interface CrawlerConfig

Returns:: Ob ein Suchindex erstellt werden soll.

getIndexDir

public String getIndexDir()

Gibt das Verzeichnis zur�ck, in dem der stehen soll.

Specified by:: getIndexDir in interface CrawlerConfig

Returns:: Das Verzeichnis, in dem der Suchindex stehen soll.

getAnalyzerType

public String getAnalyzerType()

Gibt den zu verwendenden Analyzer-Typ zur�ck.

Specified by:: getAnalyzerType in interface CrawlerConfig

Returns:: en zu verwendenden Analyzer-Typ

getMaxFieldLength

public int getMaxFieldLength()

Description copied from interface: CrawlerConfig

Returns the maximum number of terms that will be indexed for a single field in a document.

Is <= 0 if lucene's default should be used.

Specified by:: getMaxFieldLength in interface CrawlerConfig

Returns:: the maximum number of terms per document.

getStopWordList

public String[] getStopWordList()

Gibt alle Worte zur�ck, die nicht indiziert werden sollen.

Specified by:: getStopWordList in interface CrawlerConfig

Returns:: Alle Worte, die nicht indiziert werden sollen.

getExclusionList

public String[] getExclusionList()

Gibt alle Worte zur�ck, die bei der Indizierung nicht vom Analyzer ver�ndert werden sollen.

Specified by:: getExclusionList in interface CrawlerConfig

Returns:: Alle Worte, die bei der Indizierung nicht vom Analyzer ver�ndert werden sollen.

getWriteAnalysisFiles

public boolean getWriteAnalysisFiles()

Gibt zur�ck, ob Analyse-Deteien geschrieben werden sollen.

Diese Dateien helfen, die Qualit�t der Index-Erstellung zu Prüfen und werden in einem Unterverzeichnis im Index-Verzeichnis angelegt.

Specified by:: getWriteAnalysisFiles in interface CrawlerConfig

Returns:: Ob Analyse-Deteien geschrieben werden sollen.

getBreakpointInterval

public int getBreakpointInterval()

Returns the interval between two breakpoint in minutes. If set to 0, no breakpoints will be created.

Specified by:: getBreakpointInterval in interface CrawlerConfig

Returns:: the interval between two breakpoint in minutes.

getMaxFailedDocuments

public double getMaxFailedDocuments()

Gibt den maximalen Prozentsatz von gescheiterten Dokumenten zur�ck. (0..1)

Ist das Verh�lnis von gescheiterten Dokumenten zur Gesamtzahl von Dokumenten gr��er als dieser Prozentsatz, so wird der Index verworfen.

Gescheiterte Dokumente sind Dokumente die es entweder nicht gibt (Deadlink) oder die nicht ausgelesen werden konnten.

Specified by:: getMaxFailedDocuments in interface CrawlerConfig

Returns:: Den maximalen Prozentsatz von gescheiterten Dokumenten zur�ck.

getFinishedWithoutFatalsFileName

public String getFinishedWithoutFatalsFileName()

Gibt den Namen der Kontrolldatei f�r erfolgreiche Indexerstellung zur�ck.

Diese Datei wird erzeugt, wenn der Index erstellt wurde, ohne dass fatale Fehler aufgetreten sind.

Wenn keine Kontrolldatei erzeugt werden soll, dann wird null zur�ckgegeben.

Specified by:: getFinishedWithoutFatalsFileName in interface CrawlerConfig

Returns:: Der Name der Kontrolldatei f�r erfolgreiche Indexerstellung

getFinishedWithFatalsFileName

public String getFinishedWithFatalsFileName()

Gibt den Namen der Kontrolldatei f�r fehlerhafte Indexerstellung zur�ck.

Diese Datei wird erzeugt, wenn der Index erstellt wurde, wobei fatale Fehler aufgetreten sind.

Wenn keine Kontrolldatei erzeugt werden soll, dann wird null zur�ckgegeben.

Specified by:: getFinishedWithFatalsFileName in interface CrawlerConfig

Returns:: Der Name der Kontrolldatei f�r fehlerhafte Indexerstellung

getStartUrls

public StartUrl[] getStartUrls()

Gibt die StartUrls zur�ck, bei denen der Crawler-Proze� beginnen soll.

Specified by:: getStartUrls in interface CrawlerConfig

Returns:: Die StartUrls.

getHtmlParserUrlPatterns

public UrlPattern[] getHtmlParserUrlPatterns()

Gibt die UrlPattern zur�ck, die der HTML-Parser nutzen soll, um URLs zu identifizieren.

Specified by:: getHtmlParserUrlPatterns in interface CrawlerConfig

Returns:: Die UrlPattern f�r den HTML-Parser.

getBlackList

public UrlMatcher[] getBlackList()

Gets the black list.

The black list is an array of UrlMatchers, a URLs must not match to, in order to be processed.

Specified by:: getBlackList in interface CrawlerConfig

Returns:: The black list.

getWhiteList

public WhiteListEntry[] getWhiteList()

Gets the white list.

The black list is an array of WhiteListEntry, a URLs must match to, in order to be processed.

Specified by:: getWhiteList in interface CrawlerConfig

Returns:: Die Wei�e Liste

getValuePrefetchFields

public String[] getValuePrefetchFields()

Description copied from interface: CrawlerConfig

The names of the fields to prefetch the destinct values for.

Used for speeding up the search:input_fieldlist tag.

Specified by:: getValuePrefetchFields in interface CrawlerConfig

Returns:: the names of the fields to prefetch the destinct values for. May be null or empty.

getUseLinkTextAsTitleRegexList

public String[] getUseLinkTextAsTitleRegexList()

Gibt die regul�ren Ausdr�cke zur�ck, auf die die URL eines Dokuments passen muss, damit anstatt des wirklichen Dokumententitels der Text des Links, der auf das Dokument gezeigt hat, als Dokumententitel genutzt wird.

Specified by:: getUseLinkTextAsTitleRegexList in interface CrawlerConfig

Returns:: Die regul�ren Ausdr�cke, die Dokumente bestimmen, f�r die der Linktext als Titel genommen werden soll.

getPreparatorSettingsList

public PreparatorSettings[] getPreparatorSettingsList()

Gets the list with the preparator settings.

Specified by:: getPreparatorSettingsList in interface CrawlerConfig

Returns:: The list with the preparator settings.

getCrawlerPluginSettingsList

public PreparatorSettings[] getCrawlerPluginSettingsList()

Gets the list with the crawler plugin settings.

Specified by:: getCrawlerPluginSettingsList in interface CrawlerConfig

Returns:: The list with the crawler plugin settings.

getAuxiliaryFieldList

public AuxiliaryField[] getAuxiliaryFieldList()

Gets the list of the auxiliary fields.

Specified by:: getAuxiliaryFieldList in interface CrawlerConfig

Returns:: The list of the auxiliary fields. May be null.

getCrawlerAccessControllerClass

public String getCrawlerAccessControllerClass()

Gets the class name of the CrawlerAccessController to use. Returns null if no CrawlerAccessController should be used.

Specified by:: getCrawlerAccessControllerClass in interface CrawlerConfig

Returns:: The class name of the CrawlerAccessController.

getCrawlerAccessControllerJar

public String getCrawlerAccessControllerJar()

Gets the name of jar file to load the CrawlerAccessController from. Returns null if the CrawlerAccessController already is in the classpath.

Specified by:: getCrawlerAccessControllerJar in interface CrawlerConfig

Returns:: The name of jar file to load the CrawlerAccessController from.

getCrawlerAccessControllerConfig

public Properties getCrawlerAccessControllerConfig()

Gets the configuration of the CrawlerAccessController. May be null.

Specified by:: getCrawlerAccessControllerConfig in interface CrawlerConfig

Returns:: The the configuration of the CrawlerAccessController.

getUntokenizedFieldNames

public String[] getUntokenizedFieldNames()

Returns the names of the fields that shouldn't be tokenized.

Specified by:: getUntokenizedFieldNames in interface CrawlerConfig

Returns:: The names of the fields that shouldn't be tokenized.

getMaxSummaryLength

public int getMaxSummaryLength()

Returns maximum amount of characters which will be copied from content to summary

Specified by:: getMaxSummaryLength in interface CrawlerConfig

Returns:: MaxSummaryLength

getURLCleaners

public String[] getURLCleaners()

Returns the URLCleaners. URLCleaners are regex which replace parts of the URL with an empty string (in fact the remove the match from the URL.

Specified by:: getURLCleaners in interface CrawlerConfig

Returns:: the paramters

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.config Class DummyCrawlerConfig

DummyCrawlerConfig

getStoreContentForPreview

getProxyHost

getMaxCycleCount

getProxyPort

getProxyUser

getProxyPassword

getUserAgent

getHttpTimeoutSecs

getLoadUnparsedUrls

getBuildIndex

getIndexDir

getAnalyzerType

getMaxFieldLength

getStopWordList

getExclusionList

getWriteAnalysisFiles

getBreakpointInterval

getMaxFailedDocuments

getFinishedWithoutFatalsFileName

getFinishedWithFatalsFileName

getStartUrls

getHtmlParserUrlPatterns

getBlackList

getWhiteList

getValuePrefetchFields

getUseLinkTextAsTitleRegexList

getPreparatorSettingsList

getCrawlerPluginSettingsList

getAuxiliaryFieldList

getCrawlerAccessControllerClass

getCrawlerAccessControllerJar

getCrawlerAccessControllerConfig

getUntokenizedFieldNames

getMaxSummaryLength

getURLCleaners

net.sf.regain.crawler.config
Class DummyCrawlerConfig