Regain 2.1.0-STABLE API

net.sf.regain.crawler.plugin
Interface CrawlerPlugin

All Superinterfaces:
Pluggable
All Known Implementing Classes:
AbstractCrawlerPlugin, FilesizeFilterPlugin

public interface CrawlerPlugin
extends Pluggable

All Crawler Plugins need to satisfy this interface. If you want to implement only some of these method, you can inherit empty stub methods from AbstractCrawlerPlugin. A typical call order may be: onStartCrawling onAcceptURL onAcceptURL onDeclineURL onBeforePrepare onAfterPrepare onCreateIndexEntry onBeforePrepare onAfterPrepare onCreateIndexEntry ... onDeleteIndexEntry onDeleteIndexEntry onDeleteIndexEntry ... onFinishCrawling

Author:
Benjamin

Method Summary
 boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)
          Allows to blacklist specific URLs.
 void onAcceptURL(String url, CrawlerJob job)
          Called during the crawling process when a new URL is added to the processing Queue.
 void onAfterPrepare(RawDocument document, WriteablePreparator preparator)
          Called after a document is being prepared to be added to the index.
 void onBeforePrepare(RawDocument document, WriteablePreparator preparator)
          Called before a document is being prepared to be added to the index.
 void onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)
          Called when a document as added to the index.
 void onDeclineURL(String url)
          Called during the crawling process when a new URL is declined to be added to the processing Queue.
 void onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)
          Called when a document is deleted from the index.
 void onFinishCrawling(Crawler crawler)
          Called after the crawling process has finished or aborted (because of an exception).
 void onStartCrawling(Crawler crawler)
          Called before the crawling process starts (Crawler::run()).
 
Methods inherited from interface net.sf.regain.crawler.document.Pluggable
init
 

Method Detail

onStartCrawling

void onStartCrawling(Crawler crawler)
Called before the crawling process starts (Crawler::run()). This may be called multiple times during the lifetime of a plugin instance, but CrawlerPlugin::onFinishCrawling() is always called in between.

Parameters:
crawler - The crawler instance that is about to begin crawling

onFinishCrawling

void onFinishCrawling(Crawler crawler)
Called after the crawling process has finished or aborted (because of an exception). This may be called multiple times during the lifetime of a plugin instance.

Parameters:
crawler - The crawler instance that is about to finish crawling

checkDynamicBlacklist

boolean checkDynamicBlacklist(String url,
                              String sourceUrl,
                              String sourceLinkText)
Allows to blacklist specific URLs. This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist.

Parameters:
url - URL of the crawling job that should normally be added.
sourceUrl - The URL where the url above has been found (a-Tag, PDF or similar)
sourceLinkText - The label of the URL in the document where the url above has been found.
Returns:
True: blacklist this URL. False: Allow this URL.

onAcceptURL

void onAcceptURL(String url,
                 CrawlerJob job)
Called during the crawling process when a new URL is added to the processing Queue. As the queue is filled recursively, these calls can come between prepare Calls.

Parameters:
url - URL that just was accepted
job - CrawlerJob that was created as a consequence

onDeclineURL

void onDeclineURL(String url)
Called during the crawling process when a new URL is declined to be added to the processing Queue. Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Parameters:
url - URL that just was declined

onCreateIndexEntry

void onCreateIndexEntry(org.apache.lucene.document.Document doc,
                        org.apache.lucene.index.IndexWriter index)
Called when a document as added to the index. This may be a newly indexed document, or a document that has changed since and, thus, is reindexed.

Parameters:
doc - Document to write
index - Lucene Index Writer

onDeleteIndexEntry

void onDeleteIndexEntry(org.apache.lucene.document.Document doc,
                        org.apache.lucene.index.IndexReader index)
Called when a document is deleted from the index. Note that when being replaced by another document ("update index"), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase.

Parameters:
doc - Document to read
index - Luce Index Reader

onBeforePrepare

void onBeforePrepare(RawDocument document,
                     WriteablePreparator preparator)
Called before a document is being prepared to be added to the index. (Good point to fill in default values.)

Parameters:
document - Regain document that will be analysed
preparator - Preparator that was chosen to analyse this document

onAfterPrepare

void onAfterPrepare(RawDocument document,
                    WriteablePreparator preparator)
Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary. (Note that not all documents that are prepared will be added to the index. They may be parsed only in order to extract URLs or because the file was changed on the same day.)

Parameters:
document - Regain document that was analysed
preparator - Preparator that has analysed this document

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info