Regain 2.1.0-STABLE API

net.sf.regain.crawler.document
Class AbstractPreparator

java.lang.Object
  extended by net.sf.regain.crawler.document.AbstractPreparator
All Implemented Interfaces:
Pluggable, Preparator, WriteablePreparator
Direct Known Subclasses:
AbstractJacobMsOfficePreparator, DispatcherPreparator, EmptyPreparator, ExternalPreparator, FilenamePreparator, GenericAudioPreparator, HtmlPreparator, IfilterPreparator, JarPreparator, JavaPreparator, MessagePreparator, MP3Preparator, OpenOfficePreparator, PdfBoxPreparator, PlainTextPreparator, PoiMsOfficePreparator, SimpleRtfPreparator, SwingRtfPreparator, XmlPreparator, ZipPreparator

public abstract class AbstractPreparator
extends Object
implements Preparator, WriteablePreparator

Abstract implementation of a preparator.

Implements the getter methods and assumes the clean-up between two preparations (See cleanUp()).

Child class may set the values using the protected setter methods.

Author:
Til Schneider, www.murfman.de

Field Summary
private  HashMap<String,String> mAdditionalFieldMap
          The additional fields that should be indexed.
private  String mCleanedContent
          The cleaned content.
private  String mCleanedMetaData
          The cleaned meta data of the document.
private  String mHeadlines
          Die extrahierten Überschriften.
private  String[] mMimeTypes
          The assigned mimetypes for the preparator
private  PathElement[] mPath
          Der Pfad, über den das Dokument zu erreichen ist.
private  int mPriority
          The priority of the preparator.
private  String mSummary
          Die Zusammenfassung des Dokuments.
private  String mTitle
          Der gefundene Titel.
private  org.apache.regexp.RE mUrlRegex
          The regular expression a URL must match to, to be prepared by this preparator.
 
Fields inherited from interface net.sf.regain.crawler.document.Preparator
DEFAULT_BUFFER_SIZE
 
Constructor Summary
AbstractPreparator()
          Creates a new instance of AbstractPreparator.
AbstractPreparator(org.apache.regexp.RE urlRegex)
          Creates a new instance of AbstractPreparator.
AbstractPreparator(String mimeType)
          Creates a new instance of AbstractPreparator.
AbstractPreparator(String[] mimeTypeArr)
          Creates a new instance of AbstractPreparator.
 
Method Summary
 boolean accepts(RawDocument rawDocument)
          Gets whether the preparator is able to process the given document.
 void addAdditionalField(String fieldName, String fieldValue)
          Adds an additional field to the current document.
 void cleanUp()
          Release all ressources used for handling a document.
 void close()
          Frees all resources reserved by the preparator.
protected  String concatenateStringParts(List<String> parts, int maxPartsUsed)
          Concatenate all parts together, use ', ' as delimiter.
private static org.apache.regexp.RE createExtentionRegex(String extention)
          Creates a regex that matches a file extensions.
private static org.apache.regexp.RE createExtentionRegex(String[] extentionArr)
          Creates a regex that matches a set of file extensions.
 Map<String,String> getAdditionalFields()
          Gets additional fields that should be indexed.
 String getCleanedContent()
          Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück.
 String getCleanedMetaData()
           
 String getHeadlines()
          Gibt die überschriften des Dokuments zurück.
 PathElement[] getPath()
          Gibt den Pfad zurück, über den das Dokument zu erreichen ist.
 int getPriority()
          Gets the priority of the preparator
 String getSummary()
          Gibt eine Zusammenfassung für das Dokument zurück.
 String getTitle()
          Gibt den Titel des Dokuments zurück.
 void init(PreparatorConfig config)
          Initializes the preparator.
 void setCleanedContent(String cleanedContent)
          Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird.
 void setCleanedMetaData(String mCleanedMetaData)
           
 void setHeadlines(String headlines)
          Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden.
 void setPath(PathElement[] path)
          Setzt den Pfad, über den das Dokument zu erreichen ist.
 void setPriority(int priority)
          Sets the priority of the preparator
 void setSummary(String summary)
          Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird.
 void setTitle(String title)
          Setzt den Titel des Dokuments, das gerade Präpariert wird.
 void setUrlRegex(org.apache.regexp.RE urlRegex)
          Sets the regular expression a URL must match to, to be prepared by this preparator.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface net.sf.regain.crawler.document.Preparator
prepare
 

Field Detail

mUrlRegex

private org.apache.regexp.RE mUrlRegex
The regular expression a URL must match to, to be prepared by this preparator.


mTitle

private String mTitle
Der gefundene Titel.


mCleanedContent

private String mCleanedContent
The cleaned content.


mSummary

private String mSummary
Die Zusammenfassung des Dokuments.


mCleanedMetaData

private String mCleanedMetaData
The cleaned meta data of the document.


mHeadlines

private String mHeadlines
Die extrahierten Überschriften. Kann null sein


mPath

private PathElement[] mPath
Der Pfad, über den das Dokument zu erreichen ist.


mAdditionalFieldMap

private HashMap<String,String> mAdditionalFieldMap
The additional fields that should be indexed.


mMimeTypes

private String[] mMimeTypes
The assigned mimetypes for the preparator


mPriority

private int mPriority
The priority of the preparator. Used for the selection of preparators

Constructor Detail

AbstractPreparator

public AbstractPreparator()
Creates a new instance of AbstractPreparator.

The preparator won't accept any documents until a new rule was defined using setUrlRegex(RE).

See Also:
setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(org.apache.regexp.RE urlRegex)
Creates a new instance of AbstractPreparator.

If urlRegex is null, the preparator won't accept any documents.

Parameters:
urlRegex - the regex a URL must match to to be accepted by this preparator (may be null)
See Also:
setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(String mimeType)
                   throws RegainException
Creates a new instance of AbstractPreparator.

If extention is null or empty, the preparator won't accept any documents.

Parameters:
mimeType - The file extension a URL must have to be accepted by this preparator.
Throws:
RegainException - If creating the preparator failed.
See Also:
setUrlRegex(RE), accepts(RawDocument)

AbstractPreparator

public AbstractPreparator(String[] mimeTypeArr)
                   throws RegainException
Creates a new instance of AbstractPreparator.

If extentionArr is null or empty, the preparator won't accept any documents.

Parameters:
mimeTypeArr - The file extensions a URL must have one to be accepted by this preparator.
Throws:
RegainException - If creating the preparator failed.
See Also:
setUrlRegex(RE), accepts(RawDocument)
Method Detail

createExtentionRegex

private static org.apache.regexp.RE createExtentionRegex(String extention)
                                                  throws RegainException
Creates a regex that matches a file extensions.

If extention is null or empty, null will be returned.

Parameters:
extention - The file extension to create the regex for.
Returns:
The regex.
Throws:
RegainException - If the regex couldn't be created.

createExtentionRegex

private static org.apache.regexp.RE createExtentionRegex(String[] extentionArr)
                                                  throws RegainException
Creates a regex that matches a set of file extensions.

If extentionArr is null or empty, null will be returned.

Parameters:
extentionArr - The file extensions to create the regex for.
Returns:
The regex.
Throws:
RegainException - If the regex couldn't be created.

init

public void init(PreparatorConfig config)
          throws RegainException
Initializes the preparator.

Does nothing by default. May be overridden by subclasses.

Specified by:
init in interface Pluggable
Parameters:
config - The configuration for this preparator.
Throws:
RegainException - If the regular expression or the configuration has an error.

setUrlRegex

public void setUrlRegex(org.apache.regexp.RE urlRegex)
Sets the regular expression a URL must match to, to be prepared by this preparator.

If urlRegex is null, the preparator won't accept any documents.

Specified by:
setUrlRegex in interface Preparator
Parameters:
urlRegex - the new URL regex (may be null)
See Also:
accepts(RawDocument)

accepts

public boolean accepts(RawDocument rawDocument)
Gets whether the preparator is able to process the given document. This is the case, if its URL matches the URL regex.

Specified by:
accepts in interface Preparator
Parameters:
rawDocument - The document to check.
Returns:
Whether the preparator is able to process the given document.
See Also:
setUrlRegex(RE)

getTitle

public String getTitle()
Gibt den Titel des Dokuments zurück.

Falls kein Titel extrahiert werden konnte, wird null zurückgegeben.

Specified by:
getTitle in interface Preparator
Returns:
Der Titel des Dokuments.

setTitle

public void setTitle(String title)
Setzt den Titel des Dokuments, das gerade Präpariert wird.

Specified by:
setTitle in interface WriteablePreparator
Parameters:
title - Der Titel.

getCleanedContent

public String getCleanedContent()
Gibt den von Formatierungsinformation befreiten Inhalt des Dokuments zurück.

Specified by:
getCleanedContent in interface Preparator
Returns:
Der ges�uberte Inhalt.

setCleanedContent

public void setCleanedContent(String cleanedContent)
Setzt von Formatierungsinformation befreiten Inhalt des Dokuments, das gerade Präpariert wird.

Specified by:
setCleanedContent in interface WriteablePreparator
Parameters:
cleanedContent -

getCleanedMetaData

public String getCleanedMetaData()
Specified by:
getCleanedMetaData in interface Preparator
Returns:
the mCleanedMetaData

setCleanedMetaData

public void setCleanedMetaData(String mCleanedMetaData)
Specified by:
setCleanedMetaData in interface WriteablePreparator
Parameters:
mCleanedMetaData - the mCleanedMetaData to set

getSummary

public String getSummary()
Gibt eine Zusammenfassung für das Dokument zurück.

Da eine Zusammenfassung nicht einfach m�glich ist, wird null zurückgegeben.

Specified by:
getSummary in interface Preparator
Returns:
Eine Zusammenfassung für das Dokument

setSummary

public void setSummary(String summary)
Setzt die Zusammenfassung des Dokuments, das gerade Präpariert wird.

Specified by:
setSummary in interface WriteablePreparator
Parameters:
summary - Die Zusammenfassung

getHeadlines

public String getHeadlines()
Gibt die überschriften des Dokuments zurück.

Es handelt sich dabei nicht um die überschrift des Dokuments selbst, sondern lediglich um Unter-überschriften, die in dem Dokument verwendendet werden. Mit Hilfe dieser überschriften läßt sich eine bessere Relevanz berechnen.

Wenn keine überschriften gefunden wurden, dann wird null zurückgegeben.

Specified by:
getHeadlines in interface Preparator
Returns:
Die überschriften des Dokuments.

setHeadlines

public void setHeadlines(String headlines)
Setzt die überschriften, in im Dokument, das gerade Präpariert wird, gefunden wurden.

Specified by:
setHeadlines in interface WriteablePreparator
Parameters:
headlines - Die Zusammenfassung

getPath

public PathElement[] getPath()
Gibt den Pfad zurück, über den das Dokument zu erreichen ist.

Falls kein Pfad verfügbar ist, wird null zurückgegeben.

Specified by:
getPath in interface Preparator
Returns:
Der Pfad, über den das Dokument zu erreichen ist.

setPath

public void setPath(PathElement[] path)
Setzt den Pfad, über den das Dokument zu erreichen ist.

Parameters:
path - Der Pfad, über den das Dokument zu erreichen ist.

getAdditionalFields

public Map<String,String> getAdditionalFields()
Gets additional fields that should be indexed.

These fields will be indexed and stored.

Specified by:
getAdditionalFields in interface Preparator
Returns:
The additional fields or null.

addAdditionalField

public void addAdditionalField(String fieldName,
                               String fieldValue)
Adds an additional field to the current document.

This field will be indexed and stored.

Specified by:
addAdditionalField in interface WriteablePreparator
Parameters:
fieldName - The name of the field.
fieldValue - The value of the field.

getPriority

public int getPriority()
Gets the priority of the preparator

Specified by:
getPriority in interface Preparator
Returns:
int the priority

setPriority

public void setPriority(int priority)
Sets the priority of the preparator

Specified by:
setPriority in interface Preparator
Parameters:
priority - read from config or default value settings

cleanUp

public void cleanUp()
Release all ressources used for handling a document.

Specified by:
cleanUp in interface Preparator

concatenateStringParts

protected String concatenateStringParts(List<String> parts,
                                        int maxPartsUsed)
Concatenate all parts together, use ', ' as delimiter. If a parts is empty or consists only of whitespaces the part will be neglected.

Parameters:
parts - for concatenation
maxPartsUsed - number of partsused for concatenation
Returns:
the resulting string whith all single parts concatenated

close

public void close()
           throws RegainException
Frees all resources reserved by the preparator.

Is called at the end of the crawler process after all documents were processed.

Specified by:
close in interface Preparator
Throws:
RegainException - If freeing the resources failed.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info