AbstractCrawlerPlugin (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler.plugin
Class AbstractCrawlerPlugin

java.lang.Object
  net.sf.regain.crawler.plugin.AbstractCrawlerPlugin

All Implemented Interfaces:: Pluggable, CrawlerPlugin

Direct Known Subclasses:: FilesizeFilterPlugin

public abstract class AbstractCrawlerPlugin
extends Object
implements CrawlerPlugin
extends Object
implements CrawlerPlugin

Abstract Crawler Plugin. Contains empty stub method for each event.

Author:: Benjamin
See Also:: CrawlerPlugin

Constructor Summary
`AbstractCrawlerPlugin()`

Method Summary
`boolean`	`checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)` Allows to blacklist specific URLs.
`void`	`init(PreparatorConfig config)` Initializes the preparator or plugin.
`void`	`onAcceptURL(String url, CrawlerJob job)` Called during the crawling process when a new URL is added to the processing Queue.
`void`	`onAfterPrepare(RawDocument document, WriteablePreparator preparator)` Called after a document is being prepared to be added to the index.
`void`	`onBeforePrepare(RawDocument document, WriteablePreparator preparator)` Called before a document is being prepared to be added to the index.
`void`	`onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)` Called when a document as added to the index.
`void`	`onDeclineURL(String url)` Called during the crawling process when a new URL is declined to be added to the processing Queue.
`void`	`onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)` Called when a document is deleted from the index.
`void`	`onFinishCrawling(Crawler crawler)` Called after the crawling process has finished or aborted (because of an exception).
`void`	`onStartCrawling(Crawler crawler)` Called before the crawling process starts (Crawler::run()).

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

AbstractCrawlerPlugin

public AbstractCrawlerPlugin()

Method Detail

onStartCrawling

public void onStartCrawling(Crawler crawler)

Description copied from interface: CrawlerPlugin

Called before the crawling process starts (Crawler::run()). This may be called multiple times during the lifetime of a plugin instance, but CrawlerPlugin::onFinishCrawling() is always called in between.

Specified by:: onStartCrawling in interface CrawlerPlugin

Parameters:: crawler - The crawler instance that is about to begin crawling

onFinishCrawling

public void onFinishCrawling(Crawler crawler)

Description copied from interface: CrawlerPlugin

Called after the crawling process has finished or aborted (because of an exception). This may be called multiple times during the lifetime of a plugin instance.

Specified by:: onFinishCrawling in interface CrawlerPlugin

Parameters:: crawler - The crawler instance that is about to finish crawling

checkDynamicBlacklist

public boolean checkDynamicBlacklist(String url,
                                     String sourceUrl,
                                     String sourceLinkText)

Description copied from interface: CrawlerPlugin

Allows to blacklist specific URLs. This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist.

Specified by:: checkDynamicBlacklist in interface CrawlerPlugin

Parameters:: url - URL of the crawling job that should normally be added.; sourceUrl - The URL where the url above has been found (a-Tag, PDF or similar); sourceLinkText - The label of the URL in the document where the url above has been found.
Returns:: True: blacklist this URL. False: Allow this URL.

onAcceptURL

public void onAcceptURL(String url,
                        CrawlerJob job)

Description copied from interface: CrawlerPlugin

Called during the crawling process when a new URL is added to the processing Queue. As the queue is filled recursively, these calls can come between prepare Calls.

Specified by:: onAcceptURL in interface CrawlerPlugin

Parameters:: url - URL that just was accepted; job - CrawlerJob that was created as a consequence

onDeclineURL

public void onDeclineURL(String url)

Description copied from interface: CrawlerPlugin

Called during the crawling process when a new URL is declined to be added to the processing Queue. Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Specified by:: onDeclineURL in interface CrawlerPlugin

Parameters:: url - URL that just was declined

onCreateIndexEntry

public void onCreateIndexEntry(org.apache.lucene.document.Document doc,
                               org.apache.lucene.index.IndexWriter index)

Description copied from interface: CrawlerPlugin

Called when a document as added to the index. This may be a newly indexed document, or a document that has changed since and, thus, is reindexed.

Specified by:: onCreateIndexEntry in interface CrawlerPlugin

Parameters:: doc - Document to write; index - Lucene Index Writer

onDeleteIndexEntry

public void onDeleteIndexEntry(org.apache.lucene.document.Document doc,
                               org.apache.lucene.index.IndexReader index)

Description copied from interface: CrawlerPlugin

Called when a document is deleted from the index. Note that when being replaced by another document ("update index"), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase.

Specified by:: onDeleteIndexEntry in interface CrawlerPlugin

Parameters:: doc - Document to read; index - Luce Index Reader

onBeforePrepare

public void onBeforePrepare(RawDocument document,
                            WriteablePreparator preparator)

Description copied from interface: CrawlerPlugin

Called before a document is being prepared to be added to the index. (Good point to fill in default values.)

Specified by:: onBeforePrepare in interface CrawlerPlugin

Parameters:: document - Regain document that will be analysed; preparator - Preparator that was chosen to analyse this document

onAfterPrepare

public void onAfterPrepare(RawDocument document,
                           WriteablePreparator preparator)

Description copied from interface: CrawlerPlugin

Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary. (Note that not all documents that are prepared will be added to the index. They may be parsed only in order to extract URLs or because the file was changed on the same day.)

Specified by:: onAfterPrepare in interface CrawlerPlugin

Parameters:: document - Regain document that was analysed; preparator - Preparator that has analysed this document

init

public void init(PreparatorConfig config)
          throws RegainException

Description copied from interface: Pluggable

Initializes the preparator or plugin.

Specified by:: init in interface Pluggable

Parameters:: config - The configuration for this preparator or plugin.
Throws:: RegainException - When the regular expression or the configuration has an error.

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info