|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.Crawler
public class Crawler
Durchsucht alle konfigurierten Startseiten nach URLs. Die gefundenen Seiten werden je nach Einstellung nur geladen, in den Suchindex aufgenommen oder wiederum nach URLs durchsucht.
für jede URL wird Anhand der Schwarzen und der Weißen Liste entschieden, ob sie
ignoriert oder bearbeitet wird. Wenn loadUnparsedUrls
auf
false
gesetzt wurde, dann werden auch URLs ignoriert, die weder
durchsucht noch indiziert werden.
Field Summary | |
---|---|
(package private) Map<String,AccountPasswordEntry> |
accountPasswordStore
Username, password Map for configured hostnames. |
private CrawlerConfig |
mConfiguration
The configuration with the preferences. |
private Profiler |
mCrawlerJobProfiler
The profiler that measures the whole crawler jobs. |
private CrawlerJob |
mCurrentJob
The current crawler job. |
private List<Object[]> |
mDeadlinkList
Contains all found dead links. |
private int |
mErrorCount
The number of occured errors. |
private int |
mFatalErrorCount
Die Anzahl der fatalen Fehler, die aufgetreten sind. |
private org.apache.regexp.RE[] |
mHtmlParserPatternReArr
The regular expressions that belong to the respective UrlPattern for the HTML-Parser. |
private UrlPattern[] |
mHtmlParserUrlPatternArr
The UrlPattern the HTML-Parser should use to identify URLs. |
private Profiler |
mHtmlParsingProfiler
The profiler that measures the HTML-Parser. |
private IndexWriterManager |
mIndexWriterManager
The IndexWriterManager to use for adding documents to the index. |
private LinkedList<CrawlerJob> |
mJobList
Die Liste der noch zu bearbeitenden Jobs. |
private static org.apache.log4j.Logger |
mLog
The logger for this class. |
private boolean |
mShouldPause
Specifies whether the crawler should pause as soon as possible, |
private UrlChecker |
mUrlChecker
The URL checker. |
private CrawlerPluginManager |
pluginManager
Plugin Manager |
Constructor Summary | |
---|---|
Crawler(CrawlerConfig config,
Properties authProps)
Creates a new instance of Crawler. |
Method Summary | |
---|---|
private void |
addJob(String url,
String sourceUrl,
boolean shouldBeParsed,
boolean shouldBeIndexed,
String sourceLinkText)
Analysiert die URL und entscheidet, ob sie bearbeitet werden soll oder nicht. |
private void |
addStartUrls()
Adds all start URL to the job list. |
private void |
createCrawlerJobs(RawDocument rawDocument)
Creates crawler jobs from inclosed links. |
private File |
createTempDir()
|
int |
getAddedDocCount()
Gets the number of documents that were added to the index. |
long |
getCurrentJobTime()
Get the time the crawler is already working on the current job. |
String |
getCurrentJobUrl()
Gets the URL of the current job. |
int |
getErrorCount()
Gibt die Anzahl der Fehler zurück (das beinhaltet fatale und nicht fatale Fehler). |
int |
getFatalErrorCount()
Gibt Die Anzahl der fatalen Fehler zurück. |
int |
getFinishedJobCount()
Gets the number of processed documents. |
int |
getInitialDocCount()
Gets the number of documents that were in the (old) index when the IndexWriterManager was created. |
int |
getRemovedDocCount()
Gets the number of documents that will be removed from the index. |
boolean |
getShouldPause()
Gets whether the crawler is currently pausing or is pausing soon. |
private void |
handleDocumentLoadingException(RegainException exc,
CrawlerJob job)
Handles an exception caused by a failed document loadung. |
private boolean |
isExceptionFromDeadLink(Throwable thr)
Prüft, ob die Exception von einem Dead-Link herr�hrt. |
void |
logError(String msg,
Throwable thr,
boolean fatal)
Loggs an error. |
private void |
parseDirectory(File dir)
Searches a directory for URLs, that means files and sub-directories. |
private void |
parseIMAPFolder(String folderUrl)
Searches a imap directory for folder an counts the containing messages The method creates a new job for every not empty folder |
private void |
parseSmbDirectory(jcifs.smb.SmbFile dir)
Searches a samba directory for URLs, that means files and sub-directories. |
private void |
readAuthenticationProperties(Properties authProps)
Reads the authentication properties of all entries. |
void |
run(boolean updateIndex,
boolean retryFailedDocs,
String[] onlyEntriesArr)
Executes the crawler process and prints out a statistik, the dead-link-list and the error-list at the end. |
void |
setShouldPause(boolean shouldPause)
Sets whether the crawler should pause. |
private WhiteListEntry[] |
useOnlyWhiteListEntries(WhiteListEntry[] whiteList,
String[] onlyEntriesArr,
boolean updateIndex)
Sets the "should be updated"-flag for each entry in the white list. |
private void |
writeCrawledURLsList()
Writes the URLs of all crawl jobs into a file. |
private void |
writeDeadlinkAndErrorList()
Schreibt die Deadlink- und Fehlerliste ins Logfile und nochmal in eine eigene Datei. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static org.apache.log4j.Logger mLog
private CrawlerConfig mConfiguration
private UrlChecker mUrlChecker
private LinkedList<CrawlerJob> mJobList
private int mErrorCount
private int mFatalErrorCount
Fatale Fehler sind Fehler, durch die eine Erstellung oder Aktualisierung des Index verhindert wurde.
private CrawlerJob mCurrentJob
private List<Object[]> mDeadlinkList
Contains Object[]s with two elements: The first is the URL that couldn't be found (a String), the second the URL of the document where the dead link was found (a String).
private UrlPattern[] mHtmlParserUrlPatternArr
private org.apache.regexp.RE[] mHtmlParserPatternReArr
mHtmlParserUrlPatternArr
private Profiler mCrawlerJobProfiler
private Profiler mHtmlParsingProfiler
private IndexWriterManager mIndexWriterManager
private boolean mShouldPause
Map<String,AccountPasswordEntry> accountPasswordStore
private CrawlerPluginManager pluginManager
Constructor Detail |
---|
public Crawler(CrawlerConfig config, Properties authProps) throws RegainException
config
- The Configuration
RegainException
- If the regular expressions have errors.Method Detail |
---|
public int getFinishedJobCount()
public int getInitialDocCount()
public int getAddedDocCount()
public int getRemovedDocCount()
public String getCurrentJobUrl()
public long getCurrentJobTime()
public void setShouldPause(boolean shouldPause)
shouldPause
- Whether the crawler should pause.public boolean getShouldPause()
private void addJob(String url, String sourceUrl, boolean shouldBeParsed, boolean shouldBeIndexed, String sourceLinkText)
Wenn ja, dann wird ein neuer Job erzeugt und der Job-Liste hinzugefügt.
url
- Die URL des zu prüfenden Jobs.sourceUrl
- Die URL des Dokuments in der die URL des zu prüfenden Jobs
gefunden wurde.shouldBeParsed
- Gibt an, ob die URL geparst werden soll.shouldBeIndexed
- Gibt an, ob die URL indiziert werden soll.sourceLinkText
- Der Text des Links in dem die URL gefunden wurde. Ist
null
, falls die URL nicht in einem Link (also einem
a-Tag) gefunden wurde oder wenn aus sonstigen Gründen kein Link-Text
vorhanden ist.public void run(boolean updateIndex, boolean retryFailedDocs, String[] onlyEntriesArr)
updateIndex
- Specifies whether an already existing index should be
updated.retryFailedDocs
- Specifies whether a document that couldn't be
prepared the last time should be retried.onlyEntriesArr
- The names of the white list entries, that should be
updated. If null
or empty, all entries will be updated.private File createTempDir()
private void handleDocumentLoadingException(RegainException exc, CrawlerJob job)
exc
- The exception to check.job
- The job of the document.private void addStartUrls()
private void readAuthenticationProperties(Properties authProps)
private WhiteListEntry[] useOnlyWhiteListEntries(WhiteListEntry[] whiteList, String[] onlyEntriesArr, boolean updateIndex)
whiteList
- The white list to process.onlyEntriesArr
- The names of the white list entries, that should be
updated. If null
or empty, all entries will be updated.updateIndex
- Specifies whether an already existing index will be
updated in this crawler run.
private void writeCrawledURLsList()
private void writeDeadlinkAndErrorList()
private boolean isExceptionFromDeadLink(Throwable thr)
thr
- Die zu prüfende Exception
private void parseDirectory(File dir) throws RegainException
dir
- the directory to parse
RegainException
- If encoding of the found URLs failed.private void parseSmbDirectory(jcifs.smb.SmbFile dir) throws RegainException
dir
- the directory to parse
RegainException
- If encoding of the found URLs failed.private void parseIMAPFolder(String folderUrl) throws RegainException
folderUrl
- the folder to parse
RegainException
- If encoding of the found URLs failed.private void createCrawlerJobs(RawDocument rawDocument) throws RegainException
rawDocument
- A document with or without links
RegainException
- if an exception occurrs during job
creationpublic int getErrorCount()
getFatalErrorCount()
public int getFatalErrorCount()
Fatale Fehler sind Fehler, durch die eine Erstellung oder Aktualisierung des Index verhindert wurde.
getErrorCount()
public void logError(String msg, Throwable thr, boolean fatal)
logError
in interface ErrorLogger
msg
- The error message.thr
- The error. May be null
.fatal
- Specifies whether the error was fatal. An error is fatal if
it caused that the index could not be created.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |