|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.plugin.AbstractCrawlerPlugin
public abstract class AbstractCrawlerPlugin
Abstract Crawler Plugin. Contains empty stub method for each event.
CrawlerPlugin
Constructor Summary | |
---|---|
AbstractCrawlerPlugin()
|
Method Summary | |
---|---|
boolean |
checkDynamicBlacklist(String url,
String sourceUrl,
String sourceLinkText)
Allows to blacklist specific URLs. |
void |
init(PreparatorConfig config)
Initializes the preparator or plugin. |
void |
onAcceptURL(String url,
CrawlerJob job)
Called during the crawling process when a new URL is added to the processing Queue. |
void |
onAfterPrepare(RawDocument document,
WriteablePreparator preparator)
Called after a document is being prepared to be added to the index. |
void |
onBeforePrepare(RawDocument document,
WriteablePreparator preparator)
Called before a document is being prepared to be added to the index. |
void |
onCreateIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexWriter index)
Called when a document as added to the index. |
void |
onDeclineURL(String url)
Called during the crawling process when a new URL is declined to be added to the processing Queue. |
void |
onDeleteIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexReader index)
Called when a document is deleted from the index. |
void |
onFinishCrawling(Crawler crawler)
Called after the crawling process has finished or aborted (because of an exception). |
void |
onStartCrawling(Crawler crawler)
Called before the crawling process starts (Crawler::run()). |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public AbstractCrawlerPlugin()
Method Detail |
---|
public void onStartCrawling(Crawler crawler)
CrawlerPlugin
onStartCrawling
in interface CrawlerPlugin
crawler
- The crawler instance that is about to begin crawlingpublic void onFinishCrawling(Crawler crawler)
CrawlerPlugin
onFinishCrawling
in interface CrawlerPlugin
crawler
- The crawler instance that is about to finish crawlingpublic boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)
CrawlerPlugin
checkDynamicBlacklist
in interface CrawlerPlugin
url
- URL of the crawling job that should normally be added.sourceUrl
- The URL where the url above has been found (a-Tag, PDF or similar)sourceLinkText
- The label of the URL in the document where the url above has been found.
public void onAcceptURL(String url, CrawlerJob job)
CrawlerPlugin
onAcceptURL
in interface CrawlerPlugin
url
- URL that just was acceptedjob
- CrawlerJob that was created as a consequencepublic void onDeclineURL(String url)
CrawlerPlugin
onDeclineURL
in interface CrawlerPlugin
url
- URL that just was declinedpublic void onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)
CrawlerPlugin
onCreateIndexEntry
in interface CrawlerPlugin
doc
- Document to writeindex
- Lucene Index Writerpublic void onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)
CrawlerPlugin
onDeleteIndexEntry
in interface CrawlerPlugin
doc
- Document to readindex
- Luce Index Readerpublic void onBeforePrepare(RawDocument document, WriteablePreparator preparator)
CrawlerPlugin
onBeforePrepare
in interface CrawlerPlugin
document
- Regain document that will be analysedpreparator
- Preparator that was chosen to analyse this documentpublic void onAfterPrepare(RawDocument document, WriteablePreparator preparator)
CrawlerPlugin
onAfterPrepare
in interface CrawlerPlugin
document
- Regain document that was analysedpreparator
- Preparator that has analysed this documentpublic void init(PreparatorConfig config) throws RegainException
Pluggable
init
in interface Pluggable
config
- The configuration for this preparator or plugin.
RegainException
- When the regular expression or the configuration
has an error.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |