CrawlerPlugin (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.plugin
Interface CrawlerPlugin

All Superinterfaces:: Pluggable

All Known Implementing Classes:: AbstractCrawlerPlugin, FilesizeFilterPlugin

public interface CrawlerPlugin
extends Pluggable
extends Pluggable

All Crawler Plugins need to satisfy this interface. If you want to implement only some of these method, you can inherit empty stub methods from AbstractCrawlerPlugin. A typical call order may be: onStartCrawling onAcceptURL onAcceptURL onDeclineURL onBeforePrepare onAfterPrepare onCreateIndexEntry onBeforePrepare onAfterPrepare onCreateIndexEntry ... onDeleteIndexEntry onDeleteIndexEntry onDeleteIndexEntry ... onFinishCrawling

Author:: Benjamin

Method Summary
`boolean`	`checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)` Allows to blacklist specific URLs.
`void`	`onAcceptURL(String url, CrawlerJob job)` Called during the crawling process when a new URL is added to the processing Queue.
`void`	`onAfterPrepare(RawDocument document, WriteablePreparator preparator)` Called after a document is being prepared to be added to the index.
`void`	`onBeforePrepare(RawDocument document, WriteablePreparator preparator)` Called before a document is being prepared to be added to the index.
`void`	`onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)` Called when a document as added to the index.
`void`	`onDeclineURL(String url)` Called during the crawling process when a new URL is declined to be added to the processing Queue.
`void`	`onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)` Called when a document is deleted from the index.
`void`	`onFinishCrawling(Crawler crawler)` Called after the crawling process has finished or aborted (because of an exception).
`void`	`onStartCrawling(Crawler crawler)` Called before the crawling process starts (Crawler::run()).

Methods inherited from interface net.sf.regain.crawler.document.Pluggable
`init`

Method Detail

onStartCrawling

void onStartCrawling(Crawler crawler)

Called before the crawling process starts (Crawler::run()). This may be called multiple times during the lifetime of a plugin instance, but CrawlerPlugin::onFinishCrawling() is always called in between.

Parameters:: crawler - The crawler instance that is about to begin crawling

onFinishCrawling

void onFinishCrawling(Crawler crawler)

Called after the crawling process has finished or aborted (because of an exception). This may be called multiple times during the lifetime of a plugin instance.

Parameters:: crawler - The crawler instance that is about to finish crawling

checkDynamicBlacklist

boolean checkDynamicBlacklist(String url,
                              String sourceUrl,
                              String sourceLinkText)

Allows to blacklist specific URLs. This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist.

Parameters:: url - URL of the crawling job that should normally be added.; sourceUrl - The URL where the url above has been found (a-Tag, PDF or similar); sourceLinkText - The label of the URL in the document where the url above has been found.
Returns:: True: blacklist this URL. False: Allow this URL.

onAcceptURL

void onAcceptURL(String url,
                 CrawlerJob job)

Called during the crawling process when a new URL is added to the processing Queue. As the queue is filled recursively, these calls can come between prepare Calls.

Parameters:: url - URL that just was accepted; job - CrawlerJob that was created as a consequence

onDeclineURL

void onDeclineURL(String url)

Called during the crawling process when a new URL is declined to be added to the processing Queue. Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Parameters:: url - URL that just was declined

onCreateIndexEntry

void onCreateIndexEntry(org.apache.lucene.document.Document doc,
                        org.apache.lucene.index.IndexWriter index)

Called when a document as added to the index. This may be a newly indexed document, or a document that has changed since and, thus, is reindexed.

Parameters:: doc - Document to write; index - Lucene Index Writer

onDeleteIndexEntry

void onDeleteIndexEntry(org.apache.lucene.document.Document doc,
                        org.apache.lucene.index.IndexReader index)

Called when a document is deleted from the index. Note that when being replaced by another document ("update index"), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase.

Parameters:: doc - Document to read; index - Luce Index Reader

onBeforePrepare

void onBeforePrepare(RawDocument document,
                     WriteablePreparator preparator)

Called before a document is being prepared to be added to the index. (Good point to fill in default values.)

Parameters:: document - Regain document that will be analysed; preparator - Preparator that was chosen to analyse this document

onAfterPrepare

void onAfterPrepare(RawDocument document,
                    WriteablePreparator preparator)

Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary. (Note that not all documents that are prepared will be added to the index. They may be parsed only in order to extract URLs or because the file was changed on the same day.)

Parameters:: document - Regain document that was analysed; preparator - Preparator that has analysed this document