Crawler (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler
Class Crawler

java.lang.Object
  net.sf.regain.crawler.Crawler

All Implemented Interfaces:: ErrorLogger

public class Crawler
extends Object
implements ErrorLogger
extends Object
implements ErrorLogger

Durchsucht alle konfigurierten Startseiten nach URLs. Die gefundenen Seiten werden je nach Einstellung nur geladen, in den Suchindex aufgenommen oder wiederum nach URLs durchsucht.

fÃ¼r jede URL wird Anhand der Schwarzen und der WeiÃŸen Liste entschieden, ob sie ignoriert oder bearbeitet wird. Wenn loadUnparsedUrls auf false gesetzt wurde, dann werden auch URLs ignoriert, die weder durchsucht noch indiziert werden.

Author:: Til Schneider, www.murfman.de

Field Summary
`(package private) Map<String,AccountPasswordEntry>`	`accountPasswordStore` Username, password Map for configured hostnames.
`private CrawlerConfig`	`mConfiguration` The configuration with the preferences.
`private Profiler`	`mCrawlerJobProfiler` The profiler that measures the whole crawler jobs.
`private CrawlerJob`	`mCurrentJob` The current crawler job.
`private List<Object[]>`	`mDeadlinkList` Contains all found dead links.
`private int`	`mErrorCount` The number of occured errors.
`private int`	`mFatalErrorCount` Die Anzahl der fatalen Fehler, die aufgetreten sind.
`private org.apache.regexp.RE[]`	`mHtmlParserPatternReArr` The regular expressions that belong to the respective UrlPattern for the HTML-Parser.
`private UrlPattern[]`	`mHtmlParserUrlPatternArr` The UrlPattern the HTML-Parser should use to identify URLs.
`private Profiler`	`mHtmlParsingProfiler` The profiler that measures the HTML-Parser.
`private IndexWriterManager`	`mIndexWriterManager` The IndexWriterManager to use for adding documents to the index.
`private LinkedList<CrawlerJob>`	`mJobList` Die Liste der noch zu bearbeitenden Jobs.
`private static org.apache.log4j.Logger`	`mLog` The logger for this class.
`private boolean`	`mShouldPause` Specifies whether the crawler should pause as soon as possible,
`private UrlChecker`	`mUrlChecker` The URL checker.
`private CrawlerPluginManager`	`pluginManager` Plugin Manager

Constructor Summary
`Crawler(CrawlerConfig config, Properties authProps)` Creates a new instance of Crawler.

Method Summary
`private void`	`addJob(String url, String sourceUrl, boolean shouldBeParsed, boolean shouldBeIndexed, String sourceLinkText)` Analysiert die URL und entscheidet, ob sie bearbeitet werden soll oder nicht.
`private void`	`addStartUrls()` Adds all start URL to the job list.
`private void`	`createCrawlerJobs(RawDocument rawDocument)` Creates crawler jobs from inclosed links.
`private File`	`createTempDir()`
`int`	`getAddedDocCount()` Gets the number of documents that were added to the index.
`long`	`getCurrentJobTime()` Get the time the crawler is already working on the current job.
`String`	`getCurrentJobUrl()` Gets the URL of the current job.
`int`	`getErrorCount()` Gibt die Anzahl der Fehler zurÃ¼ck (das beinhaltet fatale und nicht fatale Fehler).
`int`	`getFatalErrorCount()` Gibt Die Anzahl der fatalen Fehler zurÃ¼ck.
`int`	`getFinishedJobCount()` Gets the number of processed documents.
`int`	`getInitialDocCount()` Gets the number of documents that were in the (old) index when the IndexWriterManager was created.
`int`	`getRemovedDocCount()` Gets the number of documents that will be removed from the index.
`boolean`	`getShouldPause()` Gets whether the crawler is currently pausing or is pausing soon.
`private void`	`handleDocumentLoadingException(RegainException exc, CrawlerJob job)` Handles an exception caused by a failed document loadung.
`private boolean`	`isExceptionFromDeadLink(Throwable thr)` PrÃ¼ft, ob die Exception von einem Dead-Link herrï¿½hrt.
`void`	`logError(String msg, Throwable thr, boolean fatal)` Loggs an error.
`private void`	`parseDirectory(File dir)` Searches a directory for URLs, that means files and sub-directories.
`private void`	`parseIMAPFolder(String folderUrl)` Searches a imap directory for folder an counts the containing messages The method creates a new job for every not empty folder
`private void`	`parseSmbDirectory(jcifs.smb.SmbFile dir)` Searches a samba directory for URLs, that means files and sub-directories.
`private void`	`readAuthenticationProperties(Properties authProps)` Reads the authentication properties of all entries.
`void`	`run(boolean updateIndex, boolean retryFailedDocs, String[] onlyEntriesArr)` Executes the crawler process and prints out a statistik, the dead-link-list and the error-list at the end.
`void`	`setShouldPause(boolean shouldPause)` Sets whether the crawler should pause.
`private WhiteListEntry[]`	`useOnlyWhiteListEntries(WhiteListEntry[] whiteList, String[] onlyEntriesArr, boolean updateIndex)` Sets the "should be updated"-flag for each entry in the white list.
`private void`	`writeCrawledURLsList()` Writes the URLs of all crawl jobs into a file.
`private void`	`writeDeadlinkAndErrorList()` Schreibt die Deadlink- und Fehlerliste ins Logfile und nochmal in eine eigene Datei.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

mLog

private static org.apache.log4j.Logger mLog

The logger for this class.

mConfiguration

private CrawlerConfig mConfiguration

The configuration with the preferences.

mUrlChecker

private UrlChecker mUrlChecker

The URL checker.

mJobList

private LinkedList<CrawlerJob> mJobList

Die Liste der noch zu bearbeitenden Jobs.

mErrorCount

private int mErrorCount

The number of occured errors.

mFatalErrorCount

private int mFatalErrorCount

Die Anzahl der fatalen Fehler, die aufgetreten sind.

Fatale Fehler sind Fehler, durch die eine Erstellung oder Aktualisierung des Index verhindert wurde.

mCurrentJob

private CrawlerJob mCurrentJob

The current crawler job. May be null.

mDeadlinkList

private List<Object[]> mDeadlinkList

Contains all found dead links.

Contains Object[]s with two elements: The first is the URL that couldn't be found (a String), the second the URL of the document where the dead link was found (a String).

mHtmlParserUrlPatternArr

private UrlPattern[] mHtmlParserUrlPatternArr

The UrlPattern the HTML-Parser should use to identify URLs.

mHtmlParserPatternReArr

private org.apache.regexp.RE[] mHtmlParserPatternReArr

The regular expressions that belong to the respective UrlPattern for the HTML-Parser.

See Also:: mHtmlParserUrlPatternArr

mCrawlerJobProfiler

private Profiler mCrawlerJobProfiler

The profiler that measures the whole crawler jobs.

mHtmlParsingProfiler

private Profiler mHtmlParsingProfiler

The profiler that measures the HTML-Parser.

mIndexWriterManager

private IndexWriterManager mIndexWriterManager

The IndexWriterManager to use for adding documents to the index.

mShouldPause

private boolean mShouldPause

Specifies whether the crawler should pause as soon as possible,

accountPasswordStore

Map<String,AccountPasswordEntry> accountPasswordStore

Username, password Map for configured hostnames.

pluginManager

private CrawlerPluginManager pluginManager

Plugin Manager

Constructor Detail

Crawler

public Crawler(CrawlerConfig config,
               Properties authProps)
        throws RegainException

Creates a new instance of Crawler.

Parameters:: config - The Configuration
Throws:: RegainException - If the regular expressions have errors.

Method Detail

getFinishedJobCount

public int getFinishedJobCount()

Gets the number of processed documents.

Returns:: The number of processed documents.

getInitialDocCount

public int getInitialDocCount()

Gets the number of documents that were in the (old) index when the IndexWriterManager was created.

Returns:: The initial number of documents in the index.

getAddedDocCount

public int getAddedDocCount()

Gets the number of documents that were added to the index.

Returns:: The number of documents added to the index.

getRemovedDocCount

public int getRemovedDocCount()

Gets the number of documents that will be removed from the index.

Returns:: The number of documents removed from the index.

getCurrentJobUrl

public String getCurrentJobUrl()

Gets the URL of the current job. Returns null, if the crawler has currently no job.

Returns:: The URL of the current job.

getCurrentJobTime

public long getCurrentJobTime()

Get the time the crawler is already working on the current job.

Returns:: The current working time in milli seconds. Returns -1 if the crawler has currently no job.

setShouldPause

public void setShouldPause(boolean shouldPause)

Sets whether the crawler should pause.

Parameters:: shouldPause - Whether the crawler should pause.

getShouldPause

public boolean getShouldPause()

Gets whether the crawler is currently pausing or is pausing soon.

Returns:: Whether the crawler is currently pausing.

addJob

private void addJob(String url,
                    String sourceUrl,
                    boolean shouldBeParsed,
                    boolean shouldBeIndexed,
                    String sourceLinkText)

Analysiert die URL und entscheidet, ob sie bearbeitet werden soll oder nicht.

Wenn ja, dann wird ein neuer Job erzeugt und der Job-Liste hinzugefÃ¼gt.

Parameters:: url - Die URL des zu prÃ¼fenden Jobs.; sourceUrl - Die URL des Dokuments in der die URL des zu prÃ¼fenden Jobs gefunden wurde.; shouldBeParsed - Gibt an, ob die URL geparst werden soll.; shouldBeIndexed - Gibt an, ob die URL indiziert werden soll.; sourceLinkText - Der Text des Links in dem die URL gefunden wurde. Ist null, falls die URL nicht in einem Link (also einem a-Tag) gefunden wurde oder wenn aus sonstigen GrÃ¼nden kein Link-Text vorhanden ist.

run

public void run(boolean updateIndex,
                boolean retryFailedDocs,
                String[] onlyEntriesArr)

Executes the crawler process and prints out a statistik, the dead-link-list and the error-list at the end.

Parameters:: updateIndex - Specifies whether an already existing index should be updated.; retryFailedDocs - Specifies whether a document that couldn't be prepared the last time should be retried.; onlyEntriesArr - The names of the white list entries, that should be updated. If null or empty, all entries will be updated.

createTempDir

private File createTempDir()

handleDocumentLoadingException

private void handleDocumentLoadingException(RegainException exc,
                                            CrawlerJob job)

Handles an exception caused by a failed document loadung. Checks whether the exception was caused by a dead link and puts it to the dead link list if necessary.

Parameters:: exc - The exception to check.; job - The job of the document.

addStartUrls

private void addStartUrls()

Adds all start URL to the job list.

readAuthenticationProperties

private void readAuthenticationProperties(Properties authProps)

Reads the authentication properties of all entries.

useOnlyWhiteListEntries

private WhiteListEntry[] useOnlyWhiteListEntries(WhiteListEntry[] whiteList,
                                                 String[] onlyEntriesArr,
                                                 boolean updateIndex)

Sets the "should be updated"-flag for each entry in the white list.

Parameters:: whiteList - The white list to process.; onlyEntriesArr - The names of the white list entries, that should be updated. If null or empty, all entries will be updated.; updateIndex - Specifies whether an already existing index will be updated in this crawler run.
Returns:: The processed white list.

writeCrawledURLsList

private void writeCrawledURLsList()

Writes the URLs of all crawl jobs into a file.

writeDeadlinkAndErrorList

private void writeDeadlinkAndErrorList()

Schreibt die Deadlink- und Fehlerliste ins Logfile und nochmal in eine eigene Datei. Diese stehen in einem Unterverzeichnis namens 'log'. Bei eingeschalteter Indizierung steht dieses Unterverzeichnis im Index, bei ausgeschalteter Indizierung im aktuellen Verzeichnis.

isExceptionFromDeadLink

private boolean isExceptionFromDeadLink(Throwable thr)

PrÃ¼ft, ob die Exception von einem Dead-Link herrï¿½hrt.

Parameters:: thr - Die zu prÃ¼fende Exception
Returns:: Ob die Exception von einem Dead-Link herrï¿½hrt.

parseDirectory

private void parseDirectory(File dir)
                     throws RegainException

Searches a directory for URLs, that means files and sub-directories. The method creates a new job for every match

Parameters:: dir - the directory to parse
Throws:: RegainException - If encoding of the found URLs failed.

parseSmbDirectory

private void parseSmbDirectory(jcifs.smb.SmbFile dir)
                        throws RegainException

Searches a samba directory for URLs, that means files and sub-directories. The method creates a new job for every match

Parameters:: dir - the directory to parse
Throws:: RegainException - If encoding of the found URLs failed.

parseIMAPFolder

private void parseIMAPFolder(String folderUrl)
                      throws RegainException

Searches a imap directory for folder an counts the containing messages The method creates a new job for every not empty folder

Parameters:: folderUrl - the folder to parse
Throws:: RegainException - If encoding of the found URLs failed.

createCrawlerJobs

private void createCrawlerJobs(RawDocument rawDocument)
                        throws RegainException

Creates crawler jobs from inclosed links. Every link is checked against the white-/black list.

Parameters:: rawDocument - A document with or without links
Throws:: RegainException - if an exception occurrs during job creation

getErrorCount

public int getErrorCount()

Gibt die Anzahl der Fehler zurÃ¼ck (das beinhaltet fatale und nicht fatale Fehler).

Returns:: Die Anzahl der Fehler.
See Also:: getFatalErrorCount()

getFatalErrorCount

public int getFatalErrorCount()

Gibt Die Anzahl der fatalen Fehler zurÃ¼ck.