|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.sf.regain.crawler.document.DocumentFactory
public class DocumentFactory
Fabrik, die aus der URL und den Rohdaten eines Dokuments ein Lucene-Ducument erzeugt, das nur noch den, von Formatierungen gesäuberten, Text des Dokuments, sowie seine URL und seinen Titel enthält.
Document| Field Summary | |
|---|---|
private File |
mAnalysisDir
Das Verzeichnis, in dem Analyse-Dateien erzeugt werden sollen. |
private CrawlerConfig |
mConfig
The crawler config. |
private CrawlerAccessController |
mCrawlerAccessController
The CrawlerAccessController to use for identifying the groups that
are allowed to read a document. |
private static String |
MIME_TYPE_UNKNOWN
|
(package private) org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier |
mimeTypeIdentifier
The mimetype mimeTypeIdentifier |
private static org.apache.log4j.Logger |
mLog
The logger for this class |
private int |
mMaxSummaryLength
The maximum amount of characters which will be copied from content to summary |
private Preparator[] |
mPreparatorArr
The preparators. |
private Profiler[] |
mPreparatorProfilerArr
Die Profiler, die die Bearbeitung durch die Präparatoren messen. |
private org.apache.regexp.RE[] |
mUseLinkTextAsTitleReArr
The regular expressions that, when one of them applies, cause that instead of the document title the link to that document is used as title. |
private Profiler |
mWriteAnalysisProfiler
The profile that measures the addition to index. |
private CrawlerPluginManager |
pluginManager
Crawler Plugin Manager instance |
private boolean |
storeContentForPreview
should the whole content stored in the index for a preview on the result page |
| Constructor Summary | |
|---|---|
DocumentFactory(CrawlerConfig config,
File analysisDir)
Creates a new instance of DocumentFactory. |
|
| Method Summary | |
|---|---|
void |
close()
Gibt alle Ressourcen frei, die von den Präparatoren genutzt wurden. |
private org.apache.lucene.document.Document |
createDocument(Preparator preparator,
Profiler preparatorProfiler,
RawDocument rawDocument)
Creates a lucene Document from a RawDocument using a
certain Preparator. |
org.apache.lucene.document.Document |
createDocument(RawDocument rawDocument,
ErrorLogger errorLogger)
Creates a lucene Document from a RawDocument. |
private org.apache.lucene.document.Document |
createDocument(RawDocument rawDocument,
String cleanedContent,
String title,
String summary,
String metadata,
String headlines,
PathElement[] path,
Map<String,String> additionalFieldMap)
Create a lucene Document. |
private org.apache.lucene.document.Document |
createSubstituteDocument(RawDocument rawDocument)
Creates a substitute lucene Document for a RawDocument. |
private String |
createSummaryFromContent(String content)
Erzeugt eine Zusammenfassung aus dem Inhalt eines Dokuments. |
private File |
getAnalysisFile(String url,
String extension)
Erzeugt den Dateinamen einer Analyse-Datei. |
private boolean |
hasContent(String str)
Gibt zurück, ob der String einen Inhalt hat. |
private String |
pathToString(PathElement[] path)
Wandelt einen Pfad in einen String um. |
void |
writeAnalysisFile(String url,
String extension,
String content)
Schreibt eine Analyse-Datei. |
private void |
writeContentAnalysisFile(RawDocument rawDocument)
Schreibt eine Ananlyse-Datei mit dem Inhalt des Roh-Dokuments. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
private static final String MIME_TYPE_UNKNOWN
private static org.apache.log4j.Logger mLog
private CrawlerConfig mConfig
private int mMaxSummaryLength
private boolean storeContentForPreview
private File mAnalysisDir
null, wenn keine Analyse-Dateien erzeugt werden sollen.
private Preparator[] mPreparatorArr
private Profiler[] mPreparatorProfilerArr
private CrawlerAccessController mCrawlerAccessController
CrawlerAccessController to use for identifying the groups that
are allowed to read a document. May be null.
private org.apache.regexp.RE[] mUseLinkTextAsTitleReArr
private Profiler mWriteAnalysisProfiler
org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier mimeTypeIdentifier
private CrawlerPluginManager pluginManager
| Constructor Detail |
|---|
public DocumentFactory(CrawlerConfig config,
File analysisDir)
throws RegainException
config - The crawler configuration.analysisDir - The directory where to store the analysis files. Is
null if no analysis files should be created.
RegainException - If a preparator could not be created or if a regex
has a syntax error.| Method Detail |
|---|
public org.apache.lucene.document.Document createDocument(RawDocument rawDocument,
ErrorLogger errorLogger)
Document from a RawDocument.
rawDocument - The raw document.errorLogger - The error logger to use for logging errors.
null if
the document couldn't be created.
private org.apache.lucene.document.Document createDocument(Preparator preparator,
Profiler preparatorProfiler,
RawDocument rawDocument)
throws RegainException
Document from a RawDocument using a
certain Preparator.
preparator - The preparator to use.preparatorProfiler - The profile of the preparator.rawDocument - The raw document.
RegainException - If creating the document failed.
private org.apache.lucene.document.Document createSubstituteDocument(RawDocument rawDocument)
throws RegainException
Document for a RawDocument.
Substitute documents have no "content" field, but a "preparation-error" field. They are added to the index if preparation failed. This way at least the URL may be searched and following spider runs are much faster as a previously failed document is not retried.
rawDocument - The document to create the substitute document for.
RegainException - If the user groups that are allowed to read this
document couldn't be determined.
private org.apache.lucene.document.Document createDocument(RawDocument rawDocument,
String cleanedContent,
String title,
String summary,
String metadata,
String headlines,
PathElement[] path,
Map<String,String> additionalFieldMap)
throws RegainException
Document.
rawDocument - The raw document to create the lucene Document
for.cleanedContent - The content of the document. (May be null, if the
content couldn't be extracted. In this case a substitute document is
created)title - The title. May be null.summary - The summary. May be null.metadata - The cleaned meta data. May be null.headlines - The headlines. May be null.path - The path to the document. May be null.additionalFieldMap - The additional fields provided by the preparator.
Document.
RegainException - If the user groups that are allowed to read this
document couldn't be determined.private boolean hasContent(String str)
null noch ein Leerstring ist.
str - Der zu untersuchende String
private String createSummaryFromContent(String content)
Wenn keine Zusammenfassung möglich ist, wird null
zurückgegeben.
content - Der Inhalt für den die Zusammenfassung erstellt werden soll.
null, wenn
keine erzeugt werden konnte.private String pathToString(PathElement[] path)
path - Der Pfad
private void writeContentAnalysisFile(RawDocument rawDocument)
rawDocument -
public void writeAnalysisFile(String url,
String extension,
String content)
Eine Analyse-Datei enthält die Daten des Dokuments bei jedem Zwischenschritt der Aufbereitung. Sie hilft die Qualit�t der Index-Erstellung zu Prüfen und wird in einem Unterverzeichnis im Index-Verzeichnis angelegt.
url - Die URL des Dokuments.extension - Der Erweiterung, die die Analyse-Datei erhalten soll.content - Der Inhalt, der in die Datei geschrieben werden soll.
private File getAnalysisFile(String url,
String extension)
url - Die URL des Dokuments.extension - Der Erweiterung, die die Analyse-Datei erhalten soll.
public void close()
Wird ganz am Ende des Crawler-Prozesses aufgerufen, nachdem alle Dokumente bearbeitet wurden.
|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||