Regain 2.1.0-STABLE API

net.sf.regain.crawler
Class UrlChecker

java.lang.Object
  extended by net.sf.regain.crawler.UrlChecker

public class UrlChecker
extends Object

Decides whether a URL was already accepted or ignored.

For this decision we take advantage of a specialty of the processing of file URLs: Since directory are searched by tree traversing, we can be sure that we never find the same file twice in one crawler run. The only thing we have to attend to make this true is that the start URLs are prefix free among each other (Wich is done by normalizeStartUrls(StartUrl[])).

For http-URLs we have to remember all accepted or ignored URLs, because in http URLs are found by page parsing which can ramdomly find any URL.

Author:
Til Schneider, www.murfman.de

Field Summary
private  HashSet<String> mAcceptedUrlSet
          Contains all http-URLs that have been accepted.
private  UrlMatcher[] mBlackListArr
          The black list.
private  int mIgnoredCount
          The number of URLs that have been ignored.
private  HashSet<String> mIgnoredUrlSet
          Contains all http-URLs that have been ignored.
private static org.apache.log4j.Logger mLog
          The logger for this class.
private  WhiteListEntry[] mWhiteListEntryArr
          The white list.
 
Constructor Summary
UrlChecker(WhiteListEntry[] whiteList, UrlMatcher[] blackList)
          Creates a new instance of UrlChecker.
 
Method Summary
 UrlMatcher[] createPreserveUrlMatcherArr()
          Creates an array of UrlMatchers that identify URLs that should not be deleted from the search index.
 int getIgnoredCount()
          Gets the number of URLs that have been ignored.
 HashSet<String> getmAcceptedUrlSet()
          Gets the set of accepted URLs.
 boolean hasNoCycles(String url, int maxCycles)
          This method tries to detect cycles in an URI.
 UrlMatcher isUrlAccepted(String url)
          Prüft ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.
 StartUrl[] normalizeStartUrls(StartUrl[] urlArr)
          Normalizes the start URLs
 void setAccepted(String url)
          Used by the crawler to set the accepted state for a certain URL.
 void setIgnored(String url)
          Used by the crawler to set the ignored state for a certain URL.
 boolean shouldBeKeptInIndex(String url)
          Decides whether the given URL should be kept in the search index.
 boolean wasAlreadyAccepted(String url)
          Decides whether the given URL was already accepted in a crawler run.
 boolean wasAlreadyIgnored(String url)
          Decides whether the given URL was already ignored in a crawler run.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class.


mAcceptedUrlSet

private HashSet<String> mAcceptedUrlSet
Contains all http-URLs that have been accepted.


mIgnoredUrlSet

private HashSet<String> mIgnoredUrlSet
Contains all http-URLs that have been ignored.


mIgnoredCount

private int mIgnoredCount
The number of URLs that have been ignored.


mWhiteListEntryArr

private WhiteListEntry[] mWhiteListEntryArr
The white list.

The white list is an array of WhiteListEntry, a URLs must match to, in order to be processed.


mBlackListArr

private UrlMatcher[] mBlackListArr
The black list.

The black list is an array of UrlMatchers, a URL must not match to, in order to be processed.

Constructor Detail

UrlChecker

public UrlChecker(WhiteListEntry[] whiteList,
                  UrlMatcher[] blackList)
Creates a new instance of UrlChecker.

Parameters:
whiteList - The white list. The white list is an array of WhiteListEntry, a URL must match to, in order to be processed.
blackList - The black list. The black list is an array of UrlMatchers, a URL must not match to, in order to be processed.
Method Detail

normalizeStartUrls

public StartUrl[] normalizeStartUrls(StartUrl[] urlArr)
Normalizes the start URLs

Parameters:
urlArr - The start URLs to normalize.
Returns:
The normalized start URLs.

hasNoCycles

public boolean hasNoCycles(String url,
                           int maxCycles)
This method tries to detect cycles in an URI. Every part of the path will be compared to each other. If more then maxCycles parts are detected the URI the URI will be marked as a 'cycle URI'

Parameters:
maxCycles - Count of maximum occurence of the same path part
url - the URI to be checked
Returns:
true if the URI has no cycles, false if cycles where detected.

isUrlAccepted

public UrlMatcher isUrlAccepted(String url)
Prüft ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.

Dies ist der Fall, wenn sie keinem Präfix aus der Schwarzen Liste und mindestens einem aus der Weißen Liste entspricht.

Parameters:
url - Die zu prüfende URL.
Returns:
Ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.

createPreserveUrlMatcherArr

public UrlMatcher[] createPreserveUrlMatcherArr()
Creates an array of UrlMatchers that identify URLs that should not be deleted from the search index.

This list is according to the white list entries whichs shouldBeUpdated flag is false.

Returns:
An array of UrlMatchers that identify URLs that should not be deleted from the search index.
See Also:
WhiteListEntry.shouldBeUpdated()

wasAlreadyAccepted

public boolean wasAlreadyAccepted(String url)
Decides whether the given URL was already accepted in a crawler run.

Parameters:
url - The URL to check.
Returns:
Whether the URL was already accepted.

wasAlreadyIgnored

public boolean wasAlreadyIgnored(String url)
Decides whether the given URL was already ignored in a crawler run.

Parameters:
url - The URL to check.
Returns:
Whether the URL was already ignored.

shouldBeKeptInIndex

public boolean shouldBeKeptInIndex(String url)
                            throws RegainException
Decides whether the given URL should be kept in the search index.

Parameters:
url - The URL to check.
Returns:
Whether the URL should be kept in the search index.
Throws:
RegainException - If the url is invalid.

setAccepted

public void setAccepted(String url)
Used by the crawler to set the accepted state for a certain URL.

Parameters:
url - The URL that was accepted by the crawler.

setIgnored

public void setIgnored(String url)
Used by the crawler to set the ignored state for a certain URL.

Parameters:
url - The URL that was ignored by the crawler.

getIgnoredCount

public int getIgnoredCount()
Gets the number of URLs that have been ignored.

Returns:
The number of URLs that have been ignored.

getmAcceptedUrlSet

public HashSet<String> getmAcceptedUrlSet()
Gets the set of accepted URLs.

Returns:
the mAcceptedUrlSet

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info