AbstractPreparator (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.document
Class AbstractPreparator

java.lang.Object
  net.sf.regain.crawler.document.AbstractPreparator

All Implemented Interfaces:: Pluggable, Preparator, WriteablePreparator

Direct Known Subclasses:: AbstractJacobMsOfficePreparator, DispatcherPreparator, EmptyPreparator, ExternalPreparator, FilenamePreparator, GenericAudioPreparator, HtmlPreparator, IfilterPreparator, JarPreparator, JavaPreparator, MessagePreparator, MP3Preparator, OpenOfficePreparator, PdfBoxPreparator, PlainTextPreparator, PoiMsOfficePreparator, SimpleRtfPreparator, SwingRtfPreparator, XmlPreparator, ZipPreparator

public abstract class AbstractPreparator
extends Object
implements Preparator, WriteablePreparator
extends Object
implements Preparator, WriteablePreparator

Abstract implementation of a preparator.

Implements the getter methods and assumes the clean-up between two preparations (See cleanUp()).

Child class may set the values using the protected setter methods.

Author:: Til Schneider, www.murfman.de

Field Summary
`private HashMap<String,String>`	`mAdditionalFieldMap` The additional fields that should be indexed.
`private String`	`mCleanedContent` The cleaned content.
`private String`	`mCleanedMetaData` The cleaned meta data of the document.
`private String`	`mHeadlines` Die extrahierten Überschriften.
`private String[]`	`mMimeTypes` The assigned mimetypes for the preparator
`private PathElement[]`	`mPath` Der Pfad, über den das Dokument zu erreichen ist.
`private int`	`mPriority` The priority of the preparator.
`private String`	`mSummary` Die Zusammenfassung des Dokuments.
`private String`	`mTitle` Der gefundene Titel.
`private org.apache.regexp.RE`	`mUrlRegex` The regular expression a URL must match to, to be prepared by this preparator.

Fields inherited from interface net.sf.regain.crawler.document.Preparator
`DEFAULT_BUFFER_SIZE`

Constructor Summary
`AbstractPreparator()` Creates a new instance of AbstractPreparator.
`AbstractPreparator(org.apache.regexp.RE urlRegex)` Creates a new instance of AbstractPreparator.
`AbstractPreparator(String mimeType)` Creates a new instance of AbstractPreparator.
`AbstractPreparator(String[] mimeTypeArr)` Creates a new instance of AbstractPreparator.

Method Summary
`boolean`	`accepts(RawDocument rawDocument)` Gets whether the preparator is able to process the given document.
`void`	`addAdditionalField(String fieldName, String fieldValue)` Adds an additional field to the current document.
`void`	`cleanUp()` Release all ressources used for handling a document.
`void`	`close()` Frees all resources reserved by the preparator.
`protected String`	`concatenateStringParts(List<String> parts, int maxPartsUsed)` Concatenate all parts together, use ', ' as delimiter.
`private static org.apache.regexp.RE`	`createExtentionRegex(String extention)` Creates a regex that matches a file extensions.
`private static org.apache.regexp.RE`	`createExtentionRegex(String[] extentionArr)` Creates a regex that matches a set of file extensions.
`Map<String,String>`	`getAdditionalFields()` Gets additional fields that should be indexed.
`String`	`getCleanedContent()` Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück.
`String`	`getCleanedMetaData()`
`String`	`getHeadlines()` Gibt die überschriften des Dokuments zurück.
`PathElement[]`	`getPath()` Gibt den Pfad zurück, über den das Dokument zu erreichen ist.
`int`	`getPriority()` Gets the priority of the preparator
`String`	`getSummary()` Gibt eine Zusammenfassung für das Dokument zurück.
`String`	`getTitle()` Gibt den Titel des Dokuments zurück.
`void`	`init(PreparatorConfig config)` Initializes the preparator.
`void`	`setCleanedContent(String cleanedContent)` Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird.
`void`	`setCleanedMetaData(String mCleanedMetaData)`
`void`	`setHeadlines(String headlines)` Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden.
`void`	`setPath(PathElement[] path)` Setzt den Pfad, über den das Dokument zu erreichen ist.
`void`	`setPriority(int priority)` Sets the priority of the preparator
`void`	`setSummary(String summary)` Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird.
`void`	`setTitle(String title)` Setzt den Titel des Dokuments, das gerade Präpariert wird.
`void`	`setUrlRegex(org.apache.regexp.RE urlRegex)` Sets the regular expression a URL must match to, to be prepared by this preparator.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface net.sf.regain.crawler.document.Preparator
`prepare`

Field Detail

mUrlRegex

private org.apache.regexp.RE mUrlRegex

The regular expression a URL must match to, to be prepared by this preparator.

mTitle

private String mTitle

Der gefundene Titel.

mCleanedContent

private String mCleanedContent

The cleaned content.

mSummary

private String mSummary

Die Zusammenfassung des Dokuments.

mCleanedMetaData

private String mCleanedMetaData

The cleaned meta data of the document.

mHeadlines

private String mHeadlines

Die extrahierten Überschriften. Kann null sein

mPath

private PathElement[] mPath

Der Pfad, über den das Dokument zu erreichen ist.

mAdditionalFieldMap

private HashMap<String,String> mAdditionalFieldMap

The additional fields that should be indexed.

mMimeTypes

private String[] mMimeTypes

The assigned mimetypes for the preparator

mPriority

private int mPriority

The priority of the preparator. Used for the selection of preparators

Constructor Detail

AbstractPreparator

public AbstractPreparator()

Creates a new instance of AbstractPreparator.

The preparator won't accept any documents until a new rule was defined using setUrlRegex(RE).

See Also:: setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(org.apache.regexp.RE urlRegex)

Creates a new instance of AbstractPreparator.

If urlRegex is null, the preparator won't accept any documents.

Parameters:: urlRegex - the regex a URL must match to to be accepted by this preparator (may be null)
See Also:: setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(String mimeType)
                   throws RegainException

Creates a new instance of AbstractPreparator.

If extention is null or empty, the preparator won't accept any documents.

Parameters:: mimeType - The file extension a URL must have to be accepted by this preparator.
Throws:: RegainException - If creating the preparator failed.
See Also:: setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(String[] mimeTypeArr)
                   throws RegainException

Creates a new instance of AbstractPreparator.

If extentionArr is null or empty, the preparator won't accept any documents.

Parameters:: mimeTypeArr - The file extensions a URL must have one to be accepted by this preparator.
Throws:: RegainException - If creating the preparator failed.
See Also:: setUrlRegex(RE), accepts(RawDocument)

Method Detail

createExtentionRegex

private static org.apache.regexp.RE createExtentionRegex(String extention)
                                                  throws RegainException

Creates a regex that matches a file extensions.

If extention is null or empty, null will be returned.

Parameters:: extention - The file extension to create the regex for.
Returns:: The regex.
Throws:: RegainException - If the regex couldn't be created.

createExtentionRegex

private static org.apache.regexp.RE createExtentionRegex(String[] extentionArr)
                                                  throws RegainException

Creates a regex that matches a set of file extensions.

If extentionArr is null or empty, null will be returned.

Parameters:: extentionArr - The file extensions to create the regex for.
Returns:: The regex.
Throws:: RegainException - If the regex couldn't be created.

init

public void init(PreparatorConfig config)
          throws RegainException

Initializes the preparator.

Does nothing by default. May be overridden by subclasses.

Specified by:: init in interface Pluggable

Parameters:: config - The configuration for this preparator.
Throws:: RegainException - If the regular expression or the configuration has an error.

setUrlRegex

public void setUrlRegex(org.apache.regexp.RE urlRegex)

Sets the regular expression a URL must match to, to be prepared by this preparator.

If urlRegex is null, the preparator won't accept any documents.

Specified by:: setUrlRegex in interface Preparator

Parameters:: urlRegex - the new URL regex (may be null)
See Also:: accepts(RawDocument)

accepts

public boolean accepts(RawDocument rawDocument)

Gets whether the preparator is able to process the given document. This is the case, if its URL matches the URL regex.

Specified by:: accepts in interface Preparator

Parameters:: rawDocument - The document to check.
Returns:: Whether the preparator is able to process the given document.
See Also:: setUrlRegex(RE)

getTitle

public String getTitle()

Gibt den Titel des Dokuments zurück.

Falls kein Titel extrahiert werden konnte, wird null zurückgegeben.

Specified by:: getTitle in interface Preparator

Returns:: Der Titel des Dokuments.

setTitle

public void setTitle(String title)

Setzt den Titel des Dokuments, das gerade Präpariert wird.

Specified by:: setTitle in interface WriteablePreparator

Parameters:: title - Der Titel.

getCleanedContent

public String getCleanedContent()

Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück.

Specified by:: getCleanedContent in interface Preparator

Returns:: Der ges�uberte Inhalt.

setCleanedContent

public void setCleanedContent(String cleanedContent)

Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird.

Specified by:: setCleanedContent in interface WriteablePreparator

Parameters:: cleanedContent -

getCleanedMetaData

public String getCleanedMetaData()

Specified by:: getCleanedMetaData in interface Preparator

Returns:: the mCleanedMetaData

setCleanedMetaData

public void setCleanedMetaData(String mCleanedMetaData)

Specified by:: setCleanedMetaData in interface WriteablePreparator

Parameters:: mCleanedMetaData - the mCleanedMetaData to set

getSummary

public String getSummary()

Gibt eine Zusammenfassung für das Dokument zurück.

Da eine Zusammenfassung nicht einfach m�glich ist, wird null zurückgegeben.

Specified by:: getSummary in interface Preparator

Returns:: Eine Zusammenfassung für das Dokument

setSummary

public void setSummary(String summary)

Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird.

Specified by:: setSummary in interface WriteablePreparator

Parameters:: summary - Die Zusammenfassung

getHeadlines

public String getHeadlines()

Gibt die überschriften des Dokuments zurück.

Es handelt sich dabei nicht um die überschrift des Dokuments selbst, sondern lediglich um Unter-überschriften, die in dem Dokument verwendendet werden. Mit Hilfe dieser überschriften läßt sich eine bessere Relevanz berechnen.

Wenn keine überschriften gefunden wurden, dann wird null zurückgegeben.

Specified by:: getHeadlines in interface Preparator

Returns:: Die überschriften des Dokuments.

setHeadlines

public void setHeadlines(String headlines)

Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden.

Specified by:: setHeadlines in interface WriteablePreparator

Parameters:: headlines - Die Zusammenfassung

getPath

public PathElement[] getPath()

Gibt den Pfad zurück, über den das Dokument zu erreichen ist.

Falls kein Pfad verfügbar ist, wird null zurückgegeben.

Specified by:: getPath in interface Preparator

Returns:: Der Pfad, über den das Dokument zu erreichen ist.

setPath

public void setPath(PathElement[] path)

Setzt den Pfad, über den das Dokument zu erreichen ist.

Parameters:: path - Der Pfad, über den das Dokument zu erreichen ist.

getAdditionalFields

public Map<String,String> getAdditionalFields()

Gets additional fields that should be indexed.

These fields will be indexed and stored.

Specified by:: getAdditionalFields in interface Preparator

Returns:: The additional fields or null.

addAdditionalField

public void addAdditionalField(String fieldName,
                               String fieldValue)

Adds an additional field to the current document.

This field will be indexed and stored.

Specified by:: addAdditionalField in interface WriteablePreparator

Parameters:: fieldName - The name of the field.; fieldValue - The value of the field.

getPriority

public int getPriority()

Gets the priority of the preparator

Specified by:: getPriority in interface Preparator

Returns:: int the priority

setPriority

public void setPriority(int priority)

Sets the priority of the preparator

Specified by:: setPriority in interface Preparator

Parameters:: priority - read from config or default value settings

cleanUp

public void cleanUp()

Release all ressources used for handling a document.

Specified by:: cleanUp in interface Preparator

concatenateStringParts

protected String concatenateStringParts(List<String> parts,
                                        int maxPartsUsed)

Concatenate all parts together, use ', ' as delimiter. If a parts is empty or consists only of whitespaces the part will be neglected.

Parameters:: parts - for concatenation; maxPartsUsed - number of partsused for concatenation
Returns:: the resulting string whith all single parts concatenated

close

public void close()
           throws RegainException

Frees all resources reserved by the preparator.

Is called at the end of the crawler process after all documents were processed.

Specified by:: close in interface Preparator

Throws:: RegainException - If freeing the resources failed.

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.document Class AbstractPreparator

mUrlRegex

mTitle

mCleanedContent

mSummary

mCleanedMetaData

mHeadlines

mPath

mAdditionalFieldMap

mMimeTypes

mPriority

AbstractPreparator

AbstractPreparator

AbstractPreparator

AbstractPreparator

createExtentionRegex

createExtentionRegex

init

setUrlRegex

accepts

getTitle

setTitle

getCleanedContent

setCleanedContent

getCleanedMetaData

setCleanedMetaData

getSummary

setSummary

getHeadlines

setHeadlines

getPath

setPath

getAdditionalFields

addAdditionalField

getPriority

setPriority

cleanUp

concatenateStringParts

close

net.sf.regain.crawler.document
Class AbstractPreparator