UrlChecker (API documentation for Regain 2.1.0-STABLE)

Overview

Package

Class

Tree

Deprecated

Index

Help

Regain 2.1.0-STABLE API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.regain.crawler
Class UrlChecker

java.lang.Object
  net.sf.regain.crawler.UrlChecker

public class UrlChecker
extends Object
extends Object

Decides whether a URL was already accepted or ignored.

For this decision we take advantage of a specialty of the processing of file URLs: Since directory are searched by tree traversing, we can be sure that we never find the same file twice in one crawler run. The only thing we have to attend to make this true is that the start URLs are prefix free among each other (Wich is done by normalizeStartUrls(StartUrl[])).

For http-URLs we have to remember all accepted or ignored URLs, because in http URLs are found by page parsing which can ramdomly find any URL.

Author:: Til Schneider, www.murfman.de

Field Summary
`private HashSet<String>`	`mAcceptedUrlSet` Contains all http-URLs that have been accepted.
`private UrlMatcher[]`	`mBlackListArr` The black list.
`private int`	`mIgnoredCount` The number of URLs that have been ignored.
`private HashSet<String>`	`mIgnoredUrlSet` Contains all http-URLs that have been ignored.
`private static org.apache.log4j.Logger`	`mLog` The logger for this class.
`private WhiteListEntry[]`	`mWhiteListEntryArr` The white list.

Constructor Summary
`UrlChecker(WhiteListEntry[] whiteList, UrlMatcher[] blackList)` Creates a new instance of UrlChecker.

Method Summary
`UrlMatcher[]`	`createPreserveUrlMatcherArr()` Creates an array of UrlMatchers that identify URLs that should not be deleted from the search index.
`int`	`getIgnoredCount()` Gets the number of URLs that have been ignored.
`HashSet<String>`	`getmAcceptedUrlSet()` Gets the set of accepted URLs.
`boolean`	`hasNoCycles(String url, int maxCycles)` This method tries to detect cycles in an URI.
`UrlMatcher`	`isUrlAccepted(String url)` Prüft ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.
`StartUrl[]`	`normalizeStartUrls(StartUrl[] urlArr)` Normalizes the start URLs
`void`	`setAccepted(String url)` Used by the crawler to set the accepted state for a certain URL.
`void`	`setIgnored(String url)` Used by the crawler to set the ignored state for a certain URL.
`boolean`	`shouldBeKeptInIndex(String url)` Decides whether the given URL should be kept in the search index.
`boolean`	`wasAlreadyAccepted(String url)` Decides whether the given URL was already accepted in a crawler run.
`boolean`	`wasAlreadyIgnored(String url)` Decides whether the given URL was already ignored in a crawler run.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

mLog

private static org.apache.log4j.Logger mLog

The logger for this class.

mAcceptedUrlSet

private HashSet<String> mAcceptedUrlSet

Contains all http-URLs that have been accepted.

mIgnoredUrlSet

private HashSet<String> mIgnoredUrlSet

Contains all http-URLs that have been ignored.

mIgnoredCount

private int mIgnoredCount

The number of URLs that have been ignored.

mWhiteListEntryArr

private WhiteListEntry[] mWhiteListEntryArr

The white list.

The white list is an array of WhiteListEntry, a URLs must match to, in order to be processed.

mBlackListArr

private UrlMatcher[] mBlackListArr

The black list.

The black list is an array of UrlMatchers, a URL must not match to, in order to be processed.

Constructor Detail

UrlChecker

public UrlChecker(WhiteListEntry[] whiteList,
                  UrlMatcher[] blackList)

Creates a new instance of UrlChecker.

Parameters:: whiteList - The white list. The white list is an array of WhiteListEntry, a URL must match to, in order to be processed.; blackList - The black list. The black list is an array of UrlMatchers, a URL must not match to, in order to be processed.

Method Detail

normalizeStartUrls

public StartUrl[] normalizeStartUrls(StartUrl[] urlArr)

Normalizes the start URLs

Parameters:: urlArr - The start URLs to normalize.
Returns:: The normalized start URLs.

hasNoCycles

public boolean hasNoCycles(String url,
                           int maxCycles)

This method tries to detect cycles in an URI. Every part of the path will be compared to each other. If more then maxCycles parts are detected the URI the URI will be marked as a 'cycle URI'

Parameters:: maxCycles - Count of maximum occurence of the same path part; url - the URI to be checked
Returns:: true if the URI has no cycles, false if cycles where detected.

isUrlAccepted

public UrlMatcher isUrlAccepted(String url)

Prüft ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.

Dies ist der Fall, wenn sie keinem Präfix aus der Schwarzen Liste und mindestens einem aus der Weißen Liste entspricht.

Parameters:: url - Die zu prüfende URL.
Returns:: Ob die URL von der Schwarzen und Weißen Liste akzeptiert wird.

createPreserveUrlMatcherArr

public UrlMatcher[] createPreserveUrlMatcherArr()

Creates an array of UrlMatchers that identify URLs that should not be deleted from the search index.

This list is according to the white list entries whichs shouldBeUpdated flag is false.

Returns:: An array of UrlMatchers that identify URLs that should not be deleted from the search index.
See Also:: WhiteListEntry.shouldBeUpdated()

wasAlreadyAccepted

public boolean wasAlreadyAccepted(String url)

Decides whether the given URL was already accepted in a crawler run.

Parameters:: url - The URL to check.
Returns:: Whether the URL was already accepted.

wasAlreadyIgnored

public boolean wasAlreadyIgnored(String url)

Decides whether the given URL was already ignored in a crawler run.

Parameters:: url - The URL to check.
Returns:: Whether the URL was already ignored.

shouldBeKeptInIndex

public boolean shouldBeKeptInIndex(String url)
                            throws RegainException

Decides whether the given URL should be kept in the search index.

Parameters:: url - The URL to check.
Returns:: Whether the URL should be kept in the search index.
Throws:: RegainException - If the url is invalid.

setAccepted

public void setAccepted(String url)

Used by the crawler to set the accepted state for a certain URL.

Parameters:: url - The URL that was accepted by the crawler.

setIgnored

public void setIgnored(String url)

Used by the crawler to set the ignored state for a certain URL.

Parameters:: url - The URL that was ignored by the crawler.

getIgnoredCount

public int getIgnoredCount()

Gets the number of URLs that have been ignored.

Returns:: The number of URLs that have been ignored.

getmAcceptedUrlSet

public HashSet<String> getmAcceptedUrlSet()

Gets the set of accepted URLs.

Returns:: the mAcceptedUrlSet