Regain 2.1.0-STABLE API

net.sf.regain.crawler.plugin
Class AbstractCrawlerPlugin

java.lang.Object
  extended by net.sf.regain.crawler.plugin.AbstractCrawlerPlugin
All Implemented Interfaces:
Pluggable, CrawlerPlugin
Direct Known Subclasses:
FilesizeFilterPlugin

public abstract class AbstractCrawlerPlugin
extends Object
implements CrawlerPlugin

Abstract Crawler Plugin. Contains empty stub method for each event.

Author:
Benjamin
See Also:
CrawlerPlugin

Constructor Summary
AbstractCrawlerPlugin()
           
 
Method Summary
 boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)
          Allows to blacklist specific URLs.
 void init(PreparatorConfig config)
          Initializes the preparator or plugin.
 void onAcceptURL(String url, CrawlerJob job)
          Called during the crawling process when a new URL is added to the processing Queue.
 void onAfterPrepare(RawDocument document, WriteablePreparator preparator)
          Called after a document is being prepared to be added to the index.
 void onBeforePrepare(RawDocument document, WriteablePreparator preparator)
          Called before a document is being prepared to be added to the index.
 void onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)
          Called when a document as added to the index.
 void onDeclineURL(String url)
          Called during the crawling process when a new URL is declined to be added to the processing Queue.
 void onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)
          Called when a document is deleted from the index.
 void onFinishCrawling(Crawler crawler)
          Called after the crawling process has finished or aborted (because of an exception).
 void onStartCrawling(Crawler crawler)
          Called before the crawling process starts (Crawler::run()).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractCrawlerPlugin

public AbstractCrawlerPlugin()
Method Detail

onStartCrawling

public void onStartCrawling(Crawler crawler)
Description copied from interface: CrawlerPlugin
Called before the crawling process starts (Crawler::run()). This may be called multiple times during the lifetime of a plugin instance, but CrawlerPlugin::onFinishCrawling() is always called in between.

Specified by:
onStartCrawling in interface CrawlerPlugin
Parameters:
crawler - The crawler instance that is about to begin crawling

onFinishCrawling

public void onFinishCrawling(Crawler crawler)
Description copied from interface: CrawlerPlugin
Called after the crawling process has finished or aborted (because of an exception). This may be called multiple times during the lifetime of a plugin instance.

Specified by:
onFinishCrawling in interface CrawlerPlugin
Parameters:
crawler - The crawler instance that is about to finish crawling

checkDynamicBlacklist

public boolean checkDynamicBlacklist(String url,
                                     String sourceUrl,
                                     String sourceLinkText)
Description copied from interface: CrawlerPlugin
Allows to blacklist specific URLs. This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist.

Specified by:
checkDynamicBlacklist in interface CrawlerPlugin
Parameters:
url - URL of the crawling job that should normally be added.
sourceUrl - The URL where the url above has been found (a-Tag, PDF or similar)
sourceLinkText - The label of the URL in the document where the url above has been found.
Returns:
True: blacklist this URL. False: Allow this URL.

onAcceptURL

public void onAcceptURL(String url,
                        CrawlerJob job)
Description copied from interface: CrawlerPlugin
Called during the crawling process when a new URL is added to the processing Queue. As the queue is filled recursively, these calls can come between prepare Calls.

Specified by:
onAcceptURL in interface CrawlerPlugin
Parameters:
url - URL that just was accepted
job - CrawlerJob that was created as a consequence

onDeclineURL

public void onDeclineURL(String url)
Description copied from interface: CrawlerPlugin
Called during the crawling process when a new URL is declined to be added to the processing Queue. Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Specified by:
onDeclineURL in interface CrawlerPlugin
Parameters:
url - URL that just was declined

onCreateIndexEntry

public void onCreateIndexEntry(org.apache.lucene.document.Document doc,
                               org.apache.lucene.index.IndexWriter index)
Description copied from interface: CrawlerPlugin
Called when a document as added to the index. This may be a newly indexed document, or a document that has changed since and, thus, is reindexed.

Specified by:
onCreateIndexEntry in interface CrawlerPlugin
Parameters:
doc - Document to write
index - Lucene Index Writer

onDeleteIndexEntry

public void onDeleteIndexEntry(org.apache.lucene.document.Document doc,
                               org.apache.lucene.index.IndexReader index)
Description copied from interface: CrawlerPlugin
Called when a document is deleted from the index. Note that when being replaced by another document ("update index"), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase.

Specified by:
onDeleteIndexEntry in interface CrawlerPlugin
Parameters:
doc - Document to read
index - Luce Index Reader

onBeforePrepare

public void onBeforePrepare(RawDocument document,
                            WriteablePreparator preparator)
Description copied from interface: CrawlerPlugin
Called before a document is being prepared to be added to the index. (Good point to fill in default values.)

Specified by:
onBeforePrepare in interface CrawlerPlugin
Parameters:
document - Regain document that will be analysed
preparator - Preparator that was chosen to analyse this document

onAfterPrepare

public void onAfterPrepare(RawDocument document,
                           WriteablePreparator preparator)
Description copied from interface: CrawlerPlugin
Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary. (Note that not all documents that are prepared will be added to the index. They may be parsed only in order to extract URLs or because the file was changed on the same day.)

Specified by:
onAfterPrepare in interface CrawlerPlugin
Parameters:
document - Regain document that was analysed
preparator - Preparator that has analysed this document

init

public void init(PreparatorConfig config)
          throws RegainException
Description copied from interface: Pluggable
Initializes the preparator or plugin.

Specified by:
init in interface Pluggable
Parameters:
config - The configuration for this preparator or plugin.
Throws:
RegainException - When the regular expression or the configuration has an error.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info