|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.sf.regain.crawler.document.AbstractPreparator
public abstract class AbstractPreparator
Abstract implementation of a preparator.
Implements the getter methods and assumes the clean-up between two
preparations (See cleanUp()).
Child class may set the values using the protected setter methods.
| Field Summary | |
|---|---|
private HashMap<String,String> |
mAdditionalFieldMap
The additional fields that should be indexed. |
private String |
mCleanedContent
The cleaned content. |
private String |
mCleanedMetaData
The cleaned meta data of the document. |
private String |
mHeadlines
Die extrahierten Überschriften. |
private String[] |
mMimeTypes
The assigned mimetypes for the preparator |
private PathElement[] |
mPath
Der Pfad, über den das Dokument zu erreichen ist. |
private int |
mPriority
The priority of the preparator. |
private String |
mSummary
Die Zusammenfassung des Dokuments. |
private String |
mTitle
Der gefundene Titel. |
private org.apache.regexp.RE |
mUrlRegex
The regular expression a URL must match to, to be prepared by this preparator. |
| Fields inherited from interface net.sf.regain.crawler.document.Preparator |
|---|
DEFAULT_BUFFER_SIZE |
| Constructor Summary | |
|---|---|
AbstractPreparator()
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(org.apache.regexp.RE urlRegex)
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(String mimeType)
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(String[] mimeTypeArr)
Creates a new instance of AbstractPreparator. |
|
| Method Summary | |
|---|---|
boolean |
accepts(RawDocument rawDocument)
Gets whether the preparator is able to process the given document. |
void |
addAdditionalField(String fieldName,
String fieldValue)
Adds an additional field to the current document. |
void |
cleanUp()
Release all ressources used for handling a document. |
void |
close()
Frees all resources reserved by the preparator. |
protected String |
concatenateStringParts(List<String> parts,
int maxPartsUsed)
Concatenate all parts together, use ', ' as delimiter. |
private static org.apache.regexp.RE |
createExtentionRegex(String extention)
Creates a regex that matches a file extensions. |
private static org.apache.regexp.RE |
createExtentionRegex(String[] extentionArr)
Creates a regex that matches a set of file extensions. |
Map<String,String> |
getAdditionalFields()
Gets additional fields that should be indexed. |
String |
getCleanedContent()
Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück. |
String |
getCleanedMetaData()
|
String |
getHeadlines()
Gibt die überschriften des Dokuments zurück. |
PathElement[] |
getPath()
Gibt den Pfad zurück, über den das Dokument zu erreichen ist. |
int |
getPriority()
Gets the priority of the preparator |
String |
getSummary()
Gibt eine Zusammenfassung für das Dokument zurück. |
String |
getTitle()
Gibt den Titel des Dokuments zurück. |
void |
init(PreparatorConfig config)
Initializes the preparator. |
void |
setCleanedContent(String cleanedContent)
Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird. |
void |
setCleanedMetaData(String mCleanedMetaData)
|
void |
setHeadlines(String headlines)
Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden. |
void |
setPath(PathElement[] path)
Setzt den Pfad, über den das Dokument zu erreichen ist. |
void |
setPriority(int priority)
Sets the priority of the preparator |
void |
setSummary(String summary)
Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird. |
void |
setTitle(String title)
Setzt den Titel des Dokuments, das gerade Präpariert wird. |
void |
setUrlRegex(org.apache.regexp.RE urlRegex)
Sets the regular expression a URL must match to, to be prepared by this preparator. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface net.sf.regain.crawler.document.Preparator |
|---|
prepare |
| Field Detail |
|---|
private org.apache.regexp.RE mUrlRegex
private String mTitle
private String mCleanedContent
private String mSummary
private String mCleanedMetaData
private String mHeadlines
null sein
private PathElement[] mPath
private HashMap<String,String> mAdditionalFieldMap
private String[] mMimeTypes
private int mPriority
| Constructor Detail |
|---|
public AbstractPreparator()
The preparator won't accept any documents until a new rule was defined
using setUrlRegex(RE).
setUrlRegex(RE),
accepts(RawDocument)public AbstractPreparator(org.apache.regexp.RE urlRegex)
If urlRegex is null, the preparator won't accept any documents.
urlRegex - the regex a URL must match to to be accepted by this
preparator (may be null)setUrlRegex(RE),
accepts(RawDocument)
public AbstractPreparator(String mimeType)
throws RegainException
If extention is null or empty, the preparator won't accept any
documents.
mimeType - The file extension a URL must have to be accepted by
this preparator.
RegainException - If creating the preparator failed.setUrlRegex(RE),
accepts(RawDocument)
public AbstractPreparator(String[] mimeTypeArr)
throws RegainException
If extentionArr is null or empty, the preparator won't accept
any documents.
mimeTypeArr - The file extensions a URL must have one to be accepted
by this preparator.
RegainException - If creating the preparator failed.setUrlRegex(RE),
accepts(RawDocument)| Method Detail |
|---|
private static org.apache.regexp.RE createExtentionRegex(String extention)
throws RegainException
If extention is null or empty, null will be returned.
extention - The file extension to create the regex for.
RegainException - If the regex couldn't be created.
private static org.apache.regexp.RE createExtentionRegex(String[] extentionArr)
throws RegainException
If extentionArr is null or empty, null will be returned.
extentionArr - The file extensions to create the regex for.
RegainException - If the regex couldn't be created.
public void init(PreparatorConfig config)
throws RegainException
Does nothing by default. May be overridden by subclasses.
init in interface Pluggableconfig - The configuration for this preparator.
RegainException - If the regular expression or the configuration
has an error.public void setUrlRegex(org.apache.regexp.RE urlRegex)
If urlRegex is null, the preparator won't accept any documents.
setUrlRegex in interface PreparatorurlRegex - the new URL regex (may be null)accepts(RawDocument)public boolean accepts(RawDocument rawDocument)
accepts in interface PreparatorrawDocument - The document to check.
setUrlRegex(RE)public String getTitle()
Falls kein Titel extrahiert werden konnte, wird null
zurückgegeben.
getTitle in interface Preparatorpublic void setTitle(String title)
setTitle in interface WriteablePreparatortitle - Der Titel.public String getCleanedContent()
getCleanedContent in interface Preparatorpublic void setCleanedContent(String cleanedContent)
setCleanedContent in interface WriteablePreparatorcleanedContent - public String getCleanedMetaData()
getCleanedMetaData in interface Preparatorpublic void setCleanedMetaData(String mCleanedMetaData)
setCleanedMetaData in interface WriteablePreparatormCleanedMetaData - the mCleanedMetaData to setpublic String getSummary()
Da eine Zusammenfassung nicht einfach m�glich ist, wird null
zurückgegeben.
getSummary in interface Preparatorpublic void setSummary(String summary)
setSummary in interface WriteablePreparatorsummary - Die Zusammenfassungpublic String getHeadlines()
Es handelt sich dabei nicht um die überschrift des Dokuments selbst, sondern lediglich um Unter-überschriften, die in dem Dokument verwendendet werden. Mit Hilfe dieser überschriften läßt sich eine bessere Relevanz berechnen.
Wenn keine überschriften gefunden wurden, dann wird null
zurückgegeben.
getHeadlines in interface Preparatorpublic void setHeadlines(String headlines)
setHeadlines in interface WriteablePreparatorheadlines - Die Zusammenfassungpublic PathElement[] getPath()
Falls kein Pfad verfügbar ist, wird null zurückgegeben.
getPath in interface Preparatorpublic void setPath(PathElement[] path)
path - Der Pfad, über den das Dokument zu erreichen ist.public Map<String,String> getAdditionalFields()
These fields will be indexed and stored.
getAdditionalFields in interface Preparatornull.
public void addAdditionalField(String fieldName,
String fieldValue)
This field will be indexed and stored.
addAdditionalField in interface WriteablePreparatorfieldName - The name of the field.fieldValue - The value of the field.public int getPriority()
getPriority in interface Preparatorpublic void setPriority(int priority)
setPriority in interface Preparatorpriority - read from config or default value settingspublic void cleanUp()
cleanUp in interface Preparator
protected String concatenateStringParts(List<String> parts,
int maxPartsUsed)
parts - for concatenationmaxPartsUsed - number of partsused for concatenation
public void close()
throws RegainException
Is called at the end of the crawler process after all documents were processed.
close in interface PreparatorRegainException - If freeing the resources failed.
|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||