DocumentFactory (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.document
Class DocumentFactory

java.lang.Object
  net.sf.regain.crawler.document.DocumentFactory

public class DocumentFactory
extends Object
extends Object

Fabrik, die aus der URL und den Rohdaten eines Dokuments ein Lucene-Ducument erzeugt, das nur noch den, von Formatierungen gesäuberten, Text des Dokuments, sowie seine URL und seinen Titel enthält.

Author:: Til Schneider, www.murfman.de
See Also:: Document

Field Summary
`private File`	`mAnalysisDir` Das Verzeichnis, in dem Analyse-Dateien erzeugt werden sollen.
`private CrawlerConfig`	`mConfig` The crawler config.
`private CrawlerAccessController`	`mCrawlerAccessController` The `CrawlerAccessController` to use for identifying the groups that are allowed to read a document.
`private static String`	`MIME_TYPE_UNKNOWN`
`(package private) org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier`	`mimeTypeIdentifier` The mimetype mimeTypeIdentifier
`private static org.apache.log4j.Logger`	`mLog` The logger for this class
`private int`	`mMaxSummaryLength` The maximum amount of characters which will be copied from content to summary
`private Preparator[]`	`mPreparatorArr` The preparators.
`private Profiler[]`	`mPreparatorProfilerArr` Die Profiler, die die Bearbeitung durch die Präparatoren messen.
`private org.apache.regexp.RE[]`	`mUseLinkTextAsTitleReArr` The regular expressions that, when one of them applies, cause that instead of the document title the link to that document is used as title.
`private Profiler`	`mWriteAnalysisProfiler` The profile that measures the addition to index.
`private CrawlerPluginManager`	`pluginManager` Crawler Plugin Manager instance
`private boolean`	`storeContentForPreview` should the whole content stored in the index for a preview on the result page

Constructor Summary
`DocumentFactory(CrawlerConfig config, File analysisDir)` Creates a new instance of DocumentFactory.

Method Summary
`void`	`close()` Gibt alle Ressourcen frei, die von den Präparatoren genutzt wurden.
`private org.apache.lucene.document.Document`	`createDocument(Preparator preparator, Profiler preparatorProfiler, RawDocument rawDocument)` Creates a lucene `Document` from a `RawDocument` using a certain Preparator.
`org.apache.lucene.document.Document`	`createDocument(RawDocument rawDocument, ErrorLogger errorLogger)` Creates a lucene `Document` from a `RawDocument`.
`private org.apache.lucene.document.Document`	`createDocument(RawDocument rawDocument, String cleanedContent, String title, String summary, String metadata, String headlines, PathElement[] path, Map<String,String> additionalFieldMap)` Create a lucene `Document`.
`private org.apache.lucene.document.Document`	`createSubstituteDocument(RawDocument rawDocument)` Creates a substitute lucene `Document` for a `RawDocument`.
`private String`	`createSummaryFromContent(String content)` Erzeugt eine Zusammenfassung aus dem Inhalt eines Dokuments.
`private File`	`getAnalysisFile(String url, String extension)` Erzeugt den Dateinamen einer Analyse-Datei.
`private boolean`	`hasContent(String str)` Gibt zurück, ob der String einen Inhalt hat.
`private String`	`pathToString(PathElement[] path)` Wandelt einen Pfad in einen String um.
`void`	`writeAnalysisFile(String url, String extension, String content)` Schreibt eine Analyse-Datei.
`private void`	`writeContentAnalysisFile(RawDocument rawDocument)` Schreibt eine Ananlyse-Datei mit dem Inhalt des Roh-Dokuments.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

MIME_TYPE_UNKNOWN

private static final String MIME_TYPE_UNKNOWN

See Also:: Constant Field Values

mLog

private static org.apache.log4j.Logger mLog

The logger for this class

mConfig

private CrawlerConfig mConfig

The crawler config.

mMaxSummaryLength

private int mMaxSummaryLength

The maximum amount of characters which will be copied from content to summary

storeContentForPreview

private boolean storeContentForPreview

should the whole content stored in the index for a preview on the result page

mAnalysisDir

private File mAnalysisDir

Das Verzeichnis, in dem Analyse-Dateien erzeugt werden sollen. Ist null, wenn keine Analyse-Dateien erzeugt werden sollen.

mPreparatorArr

private Preparator[] mPreparatorArr

The preparators.

mPreparatorProfilerArr

private Profiler[] mPreparatorProfilerArr

Die Profiler, die die Bearbeitung durch die Präparatoren messen.

mCrawlerAccessController

private CrawlerAccessController mCrawlerAccessController

The CrawlerAccessController to use for identifying the groups that are allowed to read a document. May be null.

mUseLinkTextAsTitleReArr

private org.apache.regexp.RE[] mUseLinkTextAsTitleReArr

The regular expressions that, when one of them applies, cause that instead of the document title the link to that document is used as title.

mWriteAnalysisProfiler

private Profiler mWriteAnalysisProfiler

The profile that measures the addition to index.

mimeTypeIdentifier

org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier mimeTypeIdentifier

The mimetype mimeTypeIdentifier

pluginManager

private CrawlerPluginManager pluginManager

Crawler Plugin Manager instance

Constructor Detail

DocumentFactory

public DocumentFactory(CrawlerConfig config,
                       File analysisDir)
                throws RegainException

Creates a new instance of DocumentFactory.

Parameters:: config - The crawler configuration.; analysisDir - The directory where to store the analysis files. Is null if no analysis files should be created.
Throws:: RegainException - If a preparator could not be created or if a regex has a syntax error.

Method Detail

createDocument

public org.apache.lucene.document.Document createDocument(RawDocument rawDocument,
                                                          ErrorLogger errorLogger)

Creates a lucene Document from a RawDocument.

Parameters:: rawDocument - The raw document.; errorLogger - The error logger to use for logging errors.
Returns:: The lucene document with the prepared data or null if the document couldn't be created.

createDocument

private org.apache.lucene.document.Document createDocument(Preparator preparator,
                                                           Profiler preparatorProfiler,
                                                           RawDocument rawDocument)
                                                    throws RegainException

Creates a lucene Document from a RawDocument using a certain Preparator.

Parameters:: preparator - The preparator to use.; preparatorProfiler - The profile of the preparator.; rawDocument - The raw document.
Returns:: The lucene document with the prepared data.
Throws:: RegainException - If creating the document failed.

createSubstituteDocument

private org.apache.lucene.document.Document createSubstituteDocument(RawDocument rawDocument)
                                                              throws RegainException

Creates a substitute lucene Document for a RawDocument.

Substitute documents have no "content" field, but a "preparation-error" field. They are added to the index if preparation failed. This way at least the URL may be searched and following spider runs are much faster as a previously failed document is not retried.

Parameters:: rawDocument - The document to create the substitute document for.
Returns:: The substitide document.
Throws:: RegainException - If the user groups that are allowed to read this document couldn't be determined.

createDocument

private org.apache.lucene.document.Document createDocument(RawDocument rawDocument,
                                                           String cleanedContent,
                                                           String title,
                                                           String summary,
                                                           String metadata,
                                                           String headlines,
                                                           PathElement[] path,
                                                           Map<String,String> additionalFieldMap)
                                                    throws RegainException

Create a lucene Document.

Parameters:: rawDocument - The raw document to create the lucene Document for.; cleanedContent - The content of the document. (May be null, if the content couldn't be extracted. In this case a substitute document is created); title - The title. May be null.; summary - The summary. May be null.; metadata - The cleaned meta data. May be null.; headlines - The headlines. May be null.; path - The path to the document. May be null.; additionalFieldMap - The additional fields provided by the preparator.
Returns:: The lucene Document.
Throws:: RegainException - If the user groups that are allowed to read this document couldn't be determined.

hasContent

private boolean hasContent(String str)

Gibt zurück, ob der String einen Inhalt hat. Dies ist der Fall, wenn er weder null noch ein Leerstring ist.

Parameters:: str - Der zu untersuchende String
Returns:: Ob der String einen Inhalt hat.

createSummaryFromContent

private String createSummaryFromContent(String content)

Erzeugt eine Zusammenfassung aus dem Inhalt eines Dokuments.

Wenn keine Zusammenfassung möglich ist, wird null zurückgegeben.

Parameters:: content - Der Inhalt für den die Zusammenfassung erstellt werden soll.
Returns:: Eine Zusammenfassung des Dokuments oder null, wenn keine erzeugt werden konnte.

pathToString

private String pathToString(PathElement[] path)

Wandelt einen Pfad in einen String um.

Parameters:: path - Der Pfad
Returns:: Der Pfad als String

writeContentAnalysisFile

private void writeContentAnalysisFile(RawDocument rawDocument)

Schreibt eine Ananlyse-Datei mit dem Inhalt des Roh-Dokuments.

Parameters:: rawDocument -

writeAnalysisFile

public void writeAnalysisFile(String url,
                              String extension,
                              String content)

Schreibt eine Analyse-Datei.

Eine Analyse-Datei enthält die Daten des Dokuments bei jedem Zwischenschritt der Aufbereitung. Sie hilft die Qualit�t der Index-Erstellung zu Prüfen und wird in einem Unterverzeichnis im Index-Verzeichnis angelegt.

Parameters:: url - Die URL des Dokuments.; extension - Der Erweiterung, die die Analyse-Datei erhalten soll.; content - Der Inhalt, der in die Datei geschrieben werden soll.

getAnalysisFile

private File getAnalysisFile(String url,
                             String extension)

Erzeugt den Dateinamen einer Analyse-Datei.

Parameters:: url - Die URL des Dokuments.; extension - Der Erweiterung, die die Analyse-Datei erhalten soll.
Returns:: Den Dateinamen einer Analyse-Datei.

close

public void close()

Gibt alle Ressourcen frei, die von den Präparatoren genutzt wurden.

Wird ganz am Ende des Crawler-Prozesses aufgerufen, nachdem alle Dokumente bearbeitet wurden.

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.document Class DocumentFactory

MIME_TYPE_UNKNOWN

mLog

mConfig

mMaxSummaryLength

storeContentForPreview

mAnalysisDir

mPreparatorArr

mPreparatorProfilerArr

mCrawlerAccessController

mUseLinkTextAsTitleReArr

mWriteAnalysisProfiler

mimeTypeIdentifier

pluginManager

DocumentFactory

createDocument

createDocument

createSubstituteDocument

createDocument

hasContent

createSummaryFromContent

pathToString

writeContentAnalysisFile

writeAnalysisFile

getAnalysisFile

close

net.sf.regain.crawler.document
Class DocumentFactory