|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
public interface CrawlerPlugin
All Crawler Plugins need to satisfy this interface. If you want to implement only some of these method, you can inherit empty stub methods from AbstractCrawlerPlugin. A typical call order may be: onStartCrawling onAcceptURL onAcceptURL onDeclineURL onBeforePrepare onAfterPrepare onCreateIndexEntry onBeforePrepare onAfterPrepare onCreateIndexEntry ... onDeleteIndexEntry onDeleteIndexEntry onDeleteIndexEntry ... onFinishCrawling
| Method Summary | |
|---|---|
boolean |
checkDynamicBlacklist(String url,
String sourceUrl,
String sourceLinkText)
Allows to blacklist specific URLs. |
void |
onAcceptURL(String url,
CrawlerJob job)
Called during the crawling process when a new URL is added to the processing Queue. |
void |
onAfterPrepare(RawDocument document,
WriteablePreparator preparator)
Called after a document is being prepared to be added to the index. |
void |
onBeforePrepare(RawDocument document,
WriteablePreparator preparator)
Called before a document is being prepared to be added to the index. |
void |
onCreateIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexWriter index)
Called when a document as added to the index. |
void |
onDeclineURL(String url)
Called during the crawling process when a new URL is declined to be added to the processing Queue. |
void |
onDeleteIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexReader index)
Called when a document is deleted from the index. |
void |
onFinishCrawling(Crawler crawler)
Called after the crawling process has finished or aborted (because of an exception). |
void |
onStartCrawling(Crawler crawler)
Called before the crawling process starts (Crawler::run()). |
| Methods inherited from interface net.sf.regain.crawler.document.Pluggable |
|---|
init |
| Method Detail |
|---|
void onStartCrawling(Crawler crawler)
crawler - The crawler instance that is about to begin crawlingvoid onFinishCrawling(Crawler crawler)
crawler - The crawler instance that is about to finish crawling
boolean checkDynamicBlacklist(String url,
String sourceUrl,
String sourceLinkText)
url - URL of the crawling job that should normally be added.sourceUrl - The URL where the url above has been found (a-Tag, PDF or similar)sourceLinkText - The label of the URL in the document where the url above has been found.
void onAcceptURL(String url,
CrawlerJob job)
url - URL that just was acceptedjob - CrawlerJob that was created as a consequencevoid onDeclineURL(String url)
url - URL that just was declined
void onCreateIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexWriter index)
doc - Document to writeindex - Lucene Index Writer
void onDeleteIndexEntry(org.apache.lucene.document.Document doc,
org.apache.lucene.index.IndexReader index)
doc - Document to readindex - Luce Index Reader
void onBeforePrepare(RawDocument document,
WriteablePreparator preparator)
document - Regain document that will be analysedpreparator - Preparator that was chosen to analyse this document
void onAfterPrepare(RawDocument document,
WriteablePreparator preparator)
document - Regain document that was analysedpreparator - Preparator that has analysed this document
|
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||