|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.UrlChecker
public class UrlChecker
Decides whether a URL was already accepted or ignored.
For this decision we take advantage of a specialty of the processing of file
URLs: Since directory are searched by tree traversing, we can be sure that
we never find the same file twice in one crawler run. The only thing we have
to attend to make this true is that the start URLs are prefix free among each
other (Wich is done by normalizeStartUrls(StartUrl[])
).
For http-URLs we have to remember all accepted or ignored URLs, because in http URLs are found by page parsing which can ramdomly find any URL.
Field Summary | |
---|---|
private HashSet<String> |
mAcceptedUrlSet
Contains all http-URLs that have been accepted. |
private UrlMatcher[] |
mBlackListArr
The black list. |
private int |
mIgnoredCount
The number of URLs that have been ignored. |
private HashSet<String> |
mIgnoredUrlSet
Contains all http-URLs that have been ignored. |
private static org.apache.log4j.Logger |
mLog
The logger for this class. |
private WhiteListEntry[] |
mWhiteListEntryArr
The white list. |
Constructor Summary | |
---|---|
UrlChecker(WhiteListEntry[] whiteList,
UrlMatcher[] blackList)
Creates a new instance of UrlChecker. |
Method Summary | |
---|---|
UrlMatcher[] |
createPreserveUrlMatcherArr()
Creates an array of UrlMatchers that identify URLs that should not be deleted from the search index. |
int |
getIgnoredCount()
Gets the number of URLs that have been ignored. |
HashSet<String> |
getmAcceptedUrlSet()
Gets the set of accepted URLs. |
boolean |
hasNoCycles(String url,
int maxCycles)
This method tries to detect cycles in an URI. |
UrlMatcher |
isUrlAccepted(String url)
Prüft ob die URL von der Schwarzen und Weißen Liste akzeptiert wird. |
StartUrl[] |
normalizeStartUrls(StartUrl[] urlArr)
Normalizes the start URLs |
void |
setAccepted(String url)
Used by the crawler to set the accepted state for a certain URL. |
void |
setIgnored(String url)
Used by the crawler to set the ignored state for a certain URL. |
boolean |
shouldBeKeptInIndex(String url)
Decides whether the given URL should be kept in the search index. |
boolean |
wasAlreadyAccepted(String url)
Decides whether the given URL was already accepted in a crawler run. |
boolean |
wasAlreadyIgnored(String url)
Decides whether the given URL was already ignored in a crawler run. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static org.apache.log4j.Logger mLog
private HashSet<String> mAcceptedUrlSet
private HashSet<String> mIgnoredUrlSet
private int mIgnoredCount
private WhiteListEntry[] mWhiteListEntryArr
The white list is an array of WhiteListEntry, a URLs must match to, in order to be processed.
private UrlMatcher[] mBlackListArr
The black list is an array of UrlMatchers, a URL must not match to, in order to be processed.
Constructor Detail |
---|
public UrlChecker(WhiteListEntry[] whiteList, UrlMatcher[] blackList)
whiteList
- The white list. The white list is an array of
WhiteListEntry, a URL must match to, in order to be processed.blackList
- The black list. The black list is an array of UrlMatchers,
a URL must not match to, in order to be processed.Method Detail |
---|
public StartUrl[] normalizeStartUrls(StartUrl[] urlArr)
urlArr
- The start URLs to normalize.
public boolean hasNoCycles(String url, int maxCycles)
maxCycles
- Count of maximum occurence of the same path parturl
- the URI to be checked
public UrlMatcher isUrlAccepted(String url)
Dies ist der Fall, wenn sie keinem Präfix aus der Schwarzen Liste und mindestens einem aus der Weißen Liste entspricht.
url
- Die zu prüfende URL.
public UrlMatcher[] createPreserveUrlMatcherArr()
This list is according to the white list entries whichs
shouldBeUpdated
flag is false
.
WhiteListEntry.shouldBeUpdated()
public boolean wasAlreadyAccepted(String url)
url
- The URL to check.
public boolean wasAlreadyIgnored(String url)
url
- The URL to check.
public boolean shouldBeKeptInIndex(String url) throws RegainException
url
- The URL to check.
RegainException
- If the url is invalid.public void setAccepted(String url)
url
- The URL that was accepted by the crawler.public void setIgnored(String url)
url
- The URL that was ignored by the crawler.public int getIgnoredCount()
public HashSet<String> getmAcceptedUrlSet()
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |