|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.document.AbstractPreparator
public abstract class AbstractPreparator
Abstract implementation of a preparator.
Implements the getter methods and assumes the clean-up between two
preparations (See cleanUp()
).
Child class may set the values using the protected setter methods.
Field Summary | |
---|---|
private HashMap<String,String> |
mAdditionalFieldMap
The additional fields that should be indexed. |
private String |
mCleanedContent
The cleaned content. |
private String |
mCleanedMetaData
The cleaned meta data of the document. |
private String |
mHeadlines
Die extrahierten Überschriften. |
private String[] |
mMimeTypes
The assigned mimetypes for the preparator |
private PathElement[] |
mPath
Der Pfad, über den das Dokument zu erreichen ist. |
private int |
mPriority
The priority of the preparator. |
private String |
mSummary
Die Zusammenfassung des Dokuments. |
private String |
mTitle
Der gefundene Titel. |
private org.apache.regexp.RE |
mUrlRegex
The regular expression a URL must match to, to be prepared by this preparator. |
Fields inherited from interface net.sf.regain.crawler.document.Preparator |
---|
DEFAULT_BUFFER_SIZE |
Constructor Summary | |
---|---|
AbstractPreparator()
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(org.apache.regexp.RE urlRegex)
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(String mimeType)
Creates a new instance of AbstractPreparator. |
|
AbstractPreparator(String[] mimeTypeArr)
Creates a new instance of AbstractPreparator. |
Method Summary | |
---|---|
boolean |
accepts(RawDocument rawDocument)
Gets whether the preparator is able to process the given document. |
void |
addAdditionalField(String fieldName,
String fieldValue)
Adds an additional field to the current document. |
void |
cleanUp()
Release all ressources used for handling a document. |
void |
close()
Frees all resources reserved by the preparator. |
protected String |
concatenateStringParts(List<String> parts,
int maxPartsUsed)
Concatenate all parts together, use ', ' as delimiter. |
private static org.apache.regexp.RE |
createExtentionRegex(String extention)
Creates a regex that matches a file extensions. |
private static org.apache.regexp.RE |
createExtentionRegex(String[] extentionArr)
Creates a regex that matches a set of file extensions. |
Map<String,String> |
getAdditionalFields()
Gets additional fields that should be indexed. |
String |
getCleanedContent()
Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück. |
String |
getCleanedMetaData()
|
String |
getHeadlines()
Gibt die überschriften des Dokuments zurück. |
PathElement[] |
getPath()
Gibt den Pfad zurück, über den das Dokument zu erreichen ist. |
int |
getPriority()
Gets the priority of the preparator |
String |
getSummary()
Gibt eine Zusammenfassung für das Dokument zurück. |
String |
getTitle()
Gibt den Titel des Dokuments zurück. |
void |
init(PreparatorConfig config)
Initializes the preparator. |
void |
setCleanedContent(String cleanedContent)
Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird. |
void |
setCleanedMetaData(String mCleanedMetaData)
|
void |
setHeadlines(String headlines)
Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden. |
void |
setPath(PathElement[] path)
Setzt den Pfad, über den das Dokument zu erreichen ist. |
void |
setPriority(int priority)
Sets the priority of the preparator |
void |
setSummary(String summary)
Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird. |
void |
setTitle(String title)
Setzt den Titel des Dokuments, das gerade Präpariert wird. |
void |
setUrlRegex(org.apache.regexp.RE urlRegex)
Sets the regular expression a URL must match to, to be prepared by this preparator. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface net.sf.regain.crawler.document.Preparator |
---|
prepare |
Field Detail |
---|
private org.apache.regexp.RE mUrlRegex
private String mTitle
private String mCleanedContent
private String mSummary
private String mCleanedMetaData
private String mHeadlines
null
sein
private PathElement[] mPath
private HashMap<String,String> mAdditionalFieldMap
private String[] mMimeTypes
private int mPriority
Constructor Detail |
---|
public AbstractPreparator()
The preparator won't accept any documents until a new rule was defined
using setUrlRegex(RE)
.
setUrlRegex(RE)
,
accepts(RawDocument)
public AbstractPreparator(org.apache.regexp.RE urlRegex)
If urlRegex
is null, the preparator won't accept any documents.
urlRegex
- the regex a URL must match to to be accepted by this
preparator (may be null)setUrlRegex(RE)
,
accepts(RawDocument)
public AbstractPreparator(String mimeType) throws RegainException
If extention
is null or empty, the preparator won't accept any
documents.
mimeType
- The file extension a URL must have to be accepted by
this preparator.
RegainException
- If creating the preparator failed.setUrlRegex(RE)
,
accepts(RawDocument)
public AbstractPreparator(String[] mimeTypeArr) throws RegainException
If extentionArr
is null or empty, the preparator won't accept
any documents.
mimeTypeArr
- The file extensions a URL must have one to be accepted
by this preparator.
RegainException
- If creating the preparator failed.setUrlRegex(RE)
,
accepts(RawDocument)
Method Detail |
---|
private static org.apache.regexp.RE createExtentionRegex(String extention) throws RegainException
If extention
is null or empty, null will be returned.
extention
- The file extension to create the regex for.
RegainException
- If the regex couldn't be created.private static org.apache.regexp.RE createExtentionRegex(String[] extentionArr) throws RegainException
If extentionArr
is null or empty, null will be returned.
extentionArr
- The file extensions to create the regex for.
RegainException
- If the regex couldn't be created.public void init(PreparatorConfig config) throws RegainException
Does nothing by default. May be overridden by subclasses.
init
in interface Pluggable
config
- The configuration for this preparator.
RegainException
- If the regular expression or the configuration
has an error.public void setUrlRegex(org.apache.regexp.RE urlRegex)
If urlRegex
is null, the preparator won't accept any documents.
setUrlRegex
in interface Preparator
urlRegex
- the new URL regex (may be null)accepts(RawDocument)
public boolean accepts(RawDocument rawDocument)
accepts
in interface Preparator
rawDocument
- The document to check.
setUrlRegex(RE)
public String getTitle()
Falls kein Titel extrahiert werden konnte, wird null
zurückgegeben.
getTitle
in interface Preparator
public void setTitle(String title)
setTitle
in interface WriteablePreparator
title
- Der Titel.public String getCleanedContent()
getCleanedContent
in interface Preparator
public void setCleanedContent(String cleanedContent)
setCleanedContent
in interface WriteablePreparator
cleanedContent
- public String getCleanedMetaData()
getCleanedMetaData
in interface Preparator
public void setCleanedMetaData(String mCleanedMetaData)
setCleanedMetaData
in interface WriteablePreparator
mCleanedMetaData
- the mCleanedMetaData to setpublic String getSummary()
Da eine Zusammenfassung nicht einfach m�glich ist, wird null
zurückgegeben.
getSummary
in interface Preparator
public void setSummary(String summary)
setSummary
in interface WriteablePreparator
summary
- Die Zusammenfassungpublic String getHeadlines()
Es handelt sich dabei nicht um die überschrift des Dokuments selbst, sondern lediglich um Unter-überschriften, die in dem Dokument verwendendet werden. Mit Hilfe dieser überschriften läßt sich eine bessere Relevanz berechnen.
Wenn keine überschriften gefunden wurden, dann wird null
zurückgegeben.
getHeadlines
in interface Preparator
public void setHeadlines(String headlines)
setHeadlines
in interface WriteablePreparator
headlines
- Die Zusammenfassungpublic PathElement[] getPath()
Falls kein Pfad verfügbar ist, wird null
zurückgegeben.
getPath
in interface Preparator
public void setPath(PathElement[] path)
path
- Der Pfad, über den das Dokument zu erreichen ist.public Map<String,String> getAdditionalFields()
These fields will be indexed and stored.
getAdditionalFields
in interface Preparator
null
.public void addAdditionalField(String fieldName, String fieldValue)
This field will be indexed and stored.
addAdditionalField
in interface WriteablePreparator
fieldName
- The name of the field.fieldValue
- The value of the field.public int getPriority()
getPriority
in interface Preparator
public void setPriority(int priority)
setPriority
in interface Preparator
priority
- read from config or default value settingspublic void cleanUp()
cleanUp
in interface Preparator
protected String concatenateStringParts(List<String> parts, int maxPartsUsed)
parts
- for concatenationmaxPartsUsed
- number of partsused for concatenation
public void close() throws RegainException
Is called at the end of the crawler process after all documents were processed.
close
in interface Preparator
RegainException
- If freeing the resources failed.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |