Regain 2.1.0-STABLE API

net.sf.regain.crawler
Class Crawler

java.lang.Object
  extended by net.sf.regain.crawler.Crawler
All Implemented Interfaces:
ErrorLogger

public class Crawler
extends Object
implements ErrorLogger

Durchsucht alle konfigurierten Startseiten nach URLs. Die gefundenen Seiten werden je nach Einstellung nur geladen, in den Suchindex aufgenommen oder wiederum nach URLs durchsucht.

für jede URL wird Anhand der Schwarzen und der Weißen Liste entschieden, ob sie ignoriert oder bearbeitet wird. Wenn loadUnparsedUrls auf false gesetzt wurde, dann werden auch URLs ignoriert, die weder durchsucht noch indiziert werden.

Author:
Til Schneider, www.murfman.de

Field Summary
(package private)  Map<String,AccountPasswordEntry> accountPasswordStore
          Username, password Map for configured hostnames.
private  CrawlerConfig mConfiguration
          The configuration with the preferences.
private  Profiler mCrawlerJobProfiler
          The profiler that measures the whole crawler jobs.
private  CrawlerJob mCurrentJob
          The current crawler job.
private  List<Object[]> mDeadlinkList
          Contains all found dead links.
private  int mErrorCount
          The number of occured errors.
private  int mFatalErrorCount
          Die Anzahl der fatalen Fehler, die aufgetreten sind.
private  org.apache.regexp.RE[] mHtmlParserPatternReArr
          The regular expressions that belong to the respective UrlPattern for the HTML-Parser.
private  UrlPattern[] mHtmlParserUrlPatternArr
          The UrlPattern the HTML-Parser should use to identify URLs.
private  Profiler mHtmlParsingProfiler
          The profiler that measures the HTML-Parser.
private  IndexWriterManager mIndexWriterManager
          The IndexWriterManager to use for adding documents to the index.
private  LinkedList<CrawlerJob> mJobList
          Die Liste der noch zu bearbeitenden Jobs.
private static org.apache.log4j.Logger mLog
          The logger for this class.
private  boolean mShouldPause
          Specifies whether the crawler should pause as soon as possible,
private  UrlChecker mUrlChecker
          The URL checker.
private  CrawlerPluginManager pluginManager
          Plugin Manager
 
Constructor Summary
Crawler(CrawlerConfig config, Properties authProps)
          Creates a new instance of Crawler.
 
Method Summary
private  void addJob(String url, String sourceUrl, boolean shouldBeParsed, boolean shouldBeIndexed, String sourceLinkText)
          Analysiert die URL und entscheidet, ob sie bearbeitet werden soll oder nicht.
private  void addStartUrls()
          Adds all start URL to the job list.
private  void createCrawlerJobs(RawDocument rawDocument)
          Creates crawler jobs from inclosed links.
private  File createTempDir()
           
 int getAddedDocCount()
          Gets the number of documents that were added to the index.
 long getCurrentJobTime()
          Get the time the crawler is already working on the current job.
 String getCurrentJobUrl()
          Gets the URL of the current job.
 int getErrorCount()
          Gibt die Anzahl der Fehler zurück (das beinhaltet fatale und nicht fatale Fehler).
 int getFatalErrorCount()
          Gibt Die Anzahl der fatalen Fehler zurück.
 int getFinishedJobCount()
          Gets the number of processed documents.
 int getInitialDocCount()
          Gets the number of documents that were in the (old) index when the IndexWriterManager was created.
 int getRemovedDocCount()
          Gets the number of documents that will be removed from the index.
 boolean getShouldPause()
          Gets whether the crawler is currently pausing or is pausing soon.
private  void handleDocumentLoadingException(RegainException exc, CrawlerJob job)
          Handles an exception caused by a failed document loadung.
private  boolean isExceptionFromDeadLink(Throwable thr)
          Prüft, ob die Exception von einem Dead-Link herr�hrt.
 void logError(String msg, Throwable thr, boolean fatal)
          Loggs an error.
private  void parseDirectory(File dir)
          Searches a directory for URLs, that means files and sub-directories.
private  void parseIMAPFolder(String folderUrl)
          Searches a imap directory for folder an counts the containing messages The method creates a new job for every not empty folder
private  void parseSmbDirectory(jcifs.smb.SmbFile dir)
          Searches a samba directory for URLs, that means files and sub-directories.
private  void readAuthenticationProperties(Properties authProps)
          Reads the authentication properties of all entries.
 void run(boolean updateIndex, boolean retryFailedDocs, String[] onlyEntriesArr)
          Executes the crawler process and prints out a statistik, the dead-link-list and the error-list at the end.
 void setShouldPause(boolean shouldPause)
          Sets whether the crawler should pause.
private  WhiteListEntry[] useOnlyWhiteListEntries(WhiteListEntry[] whiteList, String[] onlyEntriesArr, boolean updateIndex)
          Sets the "should be updated"-flag for each entry in the white list.
private  void writeCrawledURLsList()
          Writes the URLs of all crawl jobs into a file.
private  void writeDeadlinkAndErrorList()
          Schreibt die Deadlink- und Fehlerliste ins Logfile und nochmal in eine eigene Datei.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class.


mConfiguration

private CrawlerConfig mConfiguration
The configuration with the preferences.


mUrlChecker

private UrlChecker mUrlChecker
The URL checker.


mJobList

private LinkedList<CrawlerJob> mJobList
Die Liste der noch zu bearbeitenden Jobs.


mErrorCount

private int mErrorCount
The number of occured errors.


mFatalErrorCount

private int mFatalErrorCount
Die Anzahl der fatalen Fehler, die aufgetreten sind.

Fatale Fehler sind Fehler, durch die eine Erstellung oder Aktualisierung des Index verhindert wurde.


mCurrentJob

private CrawlerJob mCurrentJob
The current crawler job. May be null.


mDeadlinkList

private List<Object[]> mDeadlinkList
Contains all found dead links.

Contains Object[]s with two elements: The first is the URL that couldn't be found (a String), the second the URL of the document where the dead link was found (a String).


mHtmlParserUrlPatternArr

private UrlPattern[] mHtmlParserUrlPatternArr
The UrlPattern the HTML-Parser should use to identify URLs.


mHtmlParserPatternReArr

private org.apache.regexp.RE[] mHtmlParserPatternReArr
The regular expressions that belong to the respective UrlPattern for the HTML-Parser.

See Also:
mHtmlParserUrlPatternArr

mCrawlerJobProfiler

private Profiler mCrawlerJobProfiler
The profiler that measures the whole crawler jobs.


mHtmlParsingProfiler

private Profiler mHtmlParsingProfiler
The profiler that measures the HTML-Parser.


mIndexWriterManager

private IndexWriterManager mIndexWriterManager
The IndexWriterManager to use for adding documents to the index.


mShouldPause

private boolean mShouldPause
Specifies whether the crawler should pause as soon as possible,


accountPasswordStore

Map<String,AccountPasswordEntry> accountPasswordStore
Username, password Map for configured hostnames.


pluginManager

private CrawlerPluginManager pluginManager
Plugin Manager

Constructor Detail

Crawler

public Crawler(CrawlerConfig config,
               Properties authProps)
        throws RegainException
Creates a new instance of Crawler.

Parameters:
config - The Configuration
Throws:
RegainException - If the regular expressions have errors.
Method Detail

getFinishedJobCount

public int getFinishedJobCount()
Gets the number of processed documents.

Returns:
The number of processed documents.

getInitialDocCount

public int getInitialDocCount()
Gets the number of documents that were in the (old) index when the IndexWriterManager was created.

Returns:
The initial number of documents in the index.

getAddedDocCount

public int getAddedDocCount()
Gets the number of documents that were added to the index.

Returns:
The number of documents added to the index.

getRemovedDocCount

public int getRemovedDocCount()
Gets the number of documents that will be removed from the index.

Returns:
The number of documents removed from the index.

getCurrentJobUrl

public String getCurrentJobUrl()
Gets the URL of the current job. Returns null, if the crawler has currently no job.

Returns:
The URL of the current job.

getCurrentJobTime

public long getCurrentJobTime()
Get the time the crawler is already working on the current job.

Returns:
The current working time in milli seconds. Returns -1 if the crawler has currently no job.

setShouldPause

public void setShouldPause(boolean shouldPause)
Sets whether the crawler should pause.

Parameters:
shouldPause - Whether the crawler should pause.

getShouldPause

public boolean getShouldPause()
Gets whether the crawler is currently pausing or is pausing soon.

Returns:
Whether the crawler is currently pausing.

addJob

private void addJob(String url,
                    String sourceUrl,
                    boolean shouldBeParsed,
                    boolean shouldBeIndexed,
                    String sourceLinkText)
Analysiert die URL und entscheidet, ob sie bearbeitet werden soll oder nicht.

Wenn ja, dann wird ein neuer Job erzeugt und der Job-Liste hinzugefügt.

Parameters:
url - Die URL des zu prüfenden Jobs.
sourceUrl - Die URL des Dokuments in der die URL des zu prüfenden Jobs gefunden wurde.
shouldBeParsed - Gibt an, ob die URL geparst werden soll.
shouldBeIndexed - Gibt an, ob die URL indiziert werden soll.
sourceLinkText - Der Text des Links in dem die URL gefunden wurde. Ist null, falls die URL nicht in einem Link (also einem a-Tag) gefunden wurde oder wenn aus sonstigen Gründen kein Link-Text vorhanden ist.

run

public void run(boolean updateIndex,
                boolean retryFailedDocs,
                String[] onlyEntriesArr)
Executes the crawler process and prints out a statistik, the dead-link-list and the error-list at the end.

Parameters:
updateIndex - Specifies whether an already existing index should be updated.
retryFailedDocs - Specifies whether a document that couldn't be prepared the last time should be retried.
onlyEntriesArr - The names of the white list entries, that should be updated. If null or empty, all entries will be updated.

createTempDir

private File createTempDir()

handleDocumentLoadingException

private void handleDocumentLoadingException(RegainException exc,
                                            CrawlerJob job)
Handles an exception caused by a failed document loadung. Checks whether the exception was caused by a dead link and puts it to the dead link list if necessary.

Parameters:
exc - The exception to check.
job - The job of the document.

addStartUrls

private void addStartUrls()
Adds all start URL to the job list.


readAuthenticationProperties

private void readAuthenticationProperties(Properties authProps)
Reads the authentication properties of all entries.


useOnlyWhiteListEntries

private WhiteListEntry[] useOnlyWhiteListEntries(WhiteListEntry[] whiteList,
                                                 String[] onlyEntriesArr,
                                                 boolean updateIndex)
Sets the "should be updated"-flag for each entry in the white list.

Parameters:
whiteList - The white list to process.
onlyEntriesArr - The names of the white list entries, that should be updated. If null or empty, all entries will be updated.
updateIndex - Specifies whether an already existing index will be updated in this crawler run.
Returns:
The processed white list.

writeCrawledURLsList

private void writeCrawledURLsList()
Writes the URLs of all crawl jobs into a file.


writeDeadlinkAndErrorList

private void writeDeadlinkAndErrorList()
Schreibt die Deadlink- und Fehlerliste ins Logfile und nochmal in eine eigene Datei. Diese stehen in einem Unterverzeichnis namens 'log'. Bei eingeschalteter Indizierung steht dieses Unterverzeichnis im Index, bei ausgeschalteter Indizierung im aktuellen Verzeichnis.


isExceptionFromDeadLink

private boolean isExceptionFromDeadLink(Throwable thr)
Prüft, ob die Exception von einem Dead-Link herr�hrt.

Parameters:
thr - Die zu prüfende Exception
Returns:
Ob die Exception von einem Dead-Link herr�hrt.

parseDirectory

private void parseDirectory(File dir)
                     throws RegainException
Searches a directory for URLs, that means files and sub-directories. The method creates a new job for every match

Parameters:
dir - the directory to parse
Throws:
RegainException - If encoding of the found URLs failed.

parseSmbDirectory

private void parseSmbDirectory(jcifs.smb.SmbFile dir)
                        throws RegainException
Searches a samba directory for URLs, that means files and sub-directories. The method creates a new job for every match

Parameters:
dir - the directory to parse
Throws:
RegainException - If encoding of the found URLs failed.

parseIMAPFolder

private void parseIMAPFolder(String folderUrl)
                      throws RegainException
Searches a imap directory for folder an counts the containing messages The method creates a new job for every not empty folder

Parameters:
folderUrl - the folder to parse
Throws:
RegainException - If encoding of the found URLs failed.

createCrawlerJobs

private void createCrawlerJobs(RawDocument rawDocument)
                        throws RegainException
Creates crawler jobs from inclosed links. Every link is checked against the white-/black list.

Parameters:
rawDocument - A document with or without links
Throws:
RegainException - if an exception occurrs during job creation

getErrorCount

public int getErrorCount()
Gibt die Anzahl der Fehler zurück (das beinhaltet fatale und nicht fatale Fehler).

Returns:
Die Anzahl der Fehler.
See Also:
getFatalErrorCount()

getFatalErrorCount

public int getFatalErrorCount()
Gibt Die Anzahl der fatalen Fehler zurück.

Fatale Fehler sind Fehler, durch die eine Erstellung oder Aktualisierung des Index verhindert wurde.

Returns:
Die Anzahl der fatalen Fehler.
See Also:
getErrorCount()

logError

public void logError(String msg,
                     Throwable thr,
                     boolean fatal)
Loggs an error.

Specified by:
logError in interface ErrorLogger
Parameters:
msg - The error message.
thr - The error. May be null.
fatal - Specifies whether the error was fatal. An error is fatal if it caused that the index could not be created.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info