|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.document.DocumentFactory
public class DocumentFactory
Fabrik, die aus der URL und den Rohdaten eines Dokuments ein Lucene-Ducument erzeugt, das nur noch den, von Formatierungen gesäuberten, Text des Dokuments, sowie seine URL und seinen Titel enthält.
Document
Field Summary | |
---|---|
private File |
mAnalysisDir
Das Verzeichnis, in dem Analyse-Dateien erzeugt werden sollen. |
private CrawlerConfig |
mConfig
The crawler config. |
private CrawlerAccessController |
mCrawlerAccessController
The CrawlerAccessController to use for identifying the groups that
are allowed to read a document. |
private static String |
MIME_TYPE_UNKNOWN
|
(package private) org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier |
mimeTypeIdentifier
The mimetype mimeTypeIdentifier |
private static org.apache.log4j.Logger |
mLog
The logger for this class |
private int |
mMaxSummaryLength
The maximum amount of characters which will be copied from content to summary |
private Preparator[] |
mPreparatorArr
The preparators. |
private Profiler[] |
mPreparatorProfilerArr
Die Profiler, die die Bearbeitung durch die Präparatoren messen. |
private org.apache.regexp.RE[] |
mUseLinkTextAsTitleReArr
The regular expressions that, when one of them applies, cause that instead of the document title the link to that document is used as title. |
private Profiler |
mWriteAnalysisProfiler
The profile that measures the addition to index. |
private CrawlerPluginManager |
pluginManager
Crawler Plugin Manager instance |
private boolean |
storeContentForPreview
should the whole content stored in the index for a preview on the result page |
Constructor Summary | |
---|---|
DocumentFactory(CrawlerConfig config,
File analysisDir)
Creates a new instance of DocumentFactory. |
Method Summary | |
---|---|
void |
close()
Gibt alle Ressourcen frei, die von den Präparatoren genutzt wurden. |
private org.apache.lucene.document.Document |
createDocument(Preparator preparator,
Profiler preparatorProfiler,
RawDocument rawDocument)
Creates a lucene Document from a RawDocument using a
certain Preparator. |
org.apache.lucene.document.Document |
createDocument(RawDocument rawDocument,
ErrorLogger errorLogger)
Creates a lucene Document from a RawDocument . |
private org.apache.lucene.document.Document |
createDocument(RawDocument rawDocument,
String cleanedContent,
String title,
String summary,
String metadata,
String headlines,
PathElement[] path,
Map<String,String> additionalFieldMap)
Create a lucene Document . |
private org.apache.lucene.document.Document |
createSubstituteDocument(RawDocument rawDocument)
Creates a substitute lucene Document for a RawDocument . |
private String |
createSummaryFromContent(String content)
Erzeugt eine Zusammenfassung aus dem Inhalt eines Dokuments. |
private File |
getAnalysisFile(String url,
String extension)
Erzeugt den Dateinamen einer Analyse-Datei. |
private boolean |
hasContent(String str)
Gibt zurück, ob der String einen Inhalt hat. |
private String |
pathToString(PathElement[] path)
Wandelt einen Pfad in einen String um. |
void |
writeAnalysisFile(String url,
String extension,
String content)
Schreibt eine Analyse-Datei. |
private void |
writeContentAnalysisFile(RawDocument rawDocument)
Schreibt eine Ananlyse-Datei mit dem Inhalt des Roh-Dokuments. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static final String MIME_TYPE_UNKNOWN
private static org.apache.log4j.Logger mLog
private CrawlerConfig mConfig
private int mMaxSummaryLength
private boolean storeContentForPreview
private File mAnalysisDir
null
, wenn keine Analyse-Dateien erzeugt werden sollen.
private Preparator[] mPreparatorArr
private Profiler[] mPreparatorProfilerArr
private CrawlerAccessController mCrawlerAccessController
CrawlerAccessController
to use for identifying the groups that
are allowed to read a document. May be null
.
private org.apache.regexp.RE[] mUseLinkTextAsTitleReArr
private Profiler mWriteAnalysisProfiler
org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier mimeTypeIdentifier
private CrawlerPluginManager pluginManager
Constructor Detail |
---|
public DocumentFactory(CrawlerConfig config, File analysisDir) throws RegainException
config
- The crawler configuration.analysisDir
- The directory where to store the analysis files. Is
null
if no analysis files should be created.
RegainException
- If a preparator could not be created or if a regex
has a syntax error.Method Detail |
---|
public org.apache.lucene.document.Document createDocument(RawDocument rawDocument, ErrorLogger errorLogger)
Document
from a RawDocument
.
rawDocument
- The raw document.errorLogger
- The error logger to use for logging errors.
null
if
the document couldn't be created.private org.apache.lucene.document.Document createDocument(Preparator preparator, Profiler preparatorProfiler, RawDocument rawDocument) throws RegainException
Document
from a RawDocument
using a
certain Preparator.
preparator
- The preparator to use.preparatorProfiler
- The profile of the preparator.rawDocument
- The raw document.
RegainException
- If creating the document failed.private org.apache.lucene.document.Document createSubstituteDocument(RawDocument rawDocument) throws RegainException
Document
for a RawDocument
.
Substitute documents have no "content" field, but a "preparation-error" field. They are added to the index if preparation failed. This way at least the URL may be searched and following spider runs are much faster as a previously failed document is not retried.
rawDocument
- The document to create the substitute document for.
RegainException
- If the user groups that are allowed to read this
document couldn't be determined.private org.apache.lucene.document.Document createDocument(RawDocument rawDocument, String cleanedContent, String title, String summary, String metadata, String headlines, PathElement[] path, Map<String,String> additionalFieldMap) throws RegainException
Document
.
rawDocument
- The raw document to create the lucene Document
for.cleanedContent
- The content of the document. (May be null, if the
content couldn't be extracted. In this case a substitute document is
created)title
- The title. May be null.summary
- The summary. May be null.metadata
- The cleaned meta data. May be null.headlines
- The headlines. May be null.path
- The path to the document. May be null.additionalFieldMap
- The additional fields provided by the preparator.
Document
.
RegainException
- If the user groups that are allowed to read this
document couldn't be determined.private boolean hasContent(String str)
null
noch ein Leerstring ist.
str
- Der zu untersuchende String
private String createSummaryFromContent(String content)
Wenn keine Zusammenfassung möglich ist, wird null
zurückgegeben.
content
- Der Inhalt für den die Zusammenfassung erstellt werden soll.
null
, wenn
keine erzeugt werden konnte.private String pathToString(PathElement[] path)
path
- Der Pfad
private void writeContentAnalysisFile(RawDocument rawDocument)
rawDocument
- public void writeAnalysisFile(String url, String extension, String content)
Eine Analyse-Datei enthält die Daten des Dokuments bei jedem Zwischenschritt der Aufbereitung. Sie hilft die Qualit�t der Index-Erstellung zu Prüfen und wird in einem Unterverzeichnis im Index-Verzeichnis angelegt.
url
- Die URL des Dokuments.extension
- Der Erweiterung, die die Analyse-Datei erhalten soll.content
- Der Inhalt, der in die Datei geschrieben werden soll.private File getAnalysisFile(String url, String extension)
url
- Die URL des Dokuments.extension
- Der Erweiterung, die die Analyse-Datei erhalten soll.
public void close()
Wird ganz am Ende des Crawler-Prozesses aufgerufen, nachdem alle Dokumente bearbeitet wurden.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |