|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface CrawlerPlugin
All Crawler Plugins need to satisfy this interface. If you want to implement only some of these method, you can inherit empty stub methods from AbstractCrawlerPlugin. A typical call order may be: onStartCrawling onAcceptURL onAcceptURL onDeclineURL onBeforePrepare onAfterPrepare onCreateIndexEntry onBeforePrepare onAfterPrepare onCreateIndexEntry ... onDeleteIndexEntry onDeleteIndexEntry onDeleteIndexEntry ... onFinishCrawling
Method Summary | |
---|---|
boolean |
checkDynamicBlacklist(String url,
String sourceUrl,
String sourceLinkText)
Allows to blacklist specific URLs. |
void |
onAcceptURL(String url,
CrawlerJob job)
Called during the crawling process when a new URL is added to the processing Queue. |
void |
onAfterPrepare(RawDocument document,
WriteablePreparator preparator)
Called after a document is being prepared to be added to the index. |
void |
onBeforePrepare(RawDocument document,
WriteablePreparator preparator)
Called before a document is being prepared to be added to the index. |
void |
onCreateIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexWriter index)
Called when a document as added to the index. |
void |
onDeclineURL(String url)
Called during the crawling process when a new URL is declined to be added to the processing Queue. |
void |
onDeleteIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexReader index)
Called when a document is deleted from the index. |
void |
onFinishCrawling(Crawler crawler)
Called after the crawling process has finished or aborted (because of an exception). |
void |
onStartCrawling(Crawler crawler)
Called before the crawling process starts (Crawler::run()). |
Methods inherited from interface net.sf.regain.crawler.document.Pluggable |
---|
init |
Method Detail |
---|
void onStartCrawling(Crawler crawler)
crawler
- The crawler instance that is about to begin crawlingvoid onFinishCrawling(Crawler crawler)
crawler
- The crawler instance that is about to finish crawlingboolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)
url
- URL of the crawling job that should normally be added.sourceUrl
- The URL where the url above has been found (a-Tag, PDF or similar)sourceLinkText
- The label of the URL in the document where the url above has been found.
void onAcceptURL(String url, CrawlerJob job)
url
- URL that just was acceptedjob
- CrawlerJob that was created as a consequencevoid onDeclineURL(String url)
url
- URL that just was declinedvoid onCreateIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexWriter index)
doc
- Document to writeindex
- Lucene Index Writervoid onDeleteIndexEntry(org.apache.lucene.document.Document doc, org.apache.lucene.index.IndexReader index)
doc
- Document to readindex
- Luce Index Readervoid onBeforePrepare(RawDocument document, WriteablePreparator preparator)
document
- Regain document that will be analysedpreparator
- Preparator that was chosen to analyse this documentvoid onAfterPrepare(RawDocument document, WriteablePreparator preparator)
document
- Regain document that was analysedpreparator
- Preparator that has analysed this document
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |