Regain 2.1.0-STABLE API

net.sf.regain.crawler
Class CrawlerToolkit

java.lang.Object
  extended by net.sf.regain.crawler.CrawlerToolkit

public class CrawlerToolkit
extends Object

Contains help methods for the crawler.

Author:
Til Schneider, www.murfman.de, Gerhard Olsson, Thomas Tesche

Field Summary
private static org.apache.log4j.Logger mLog
          The logger for this class
private static java.util.regex.Pattern urlPatternLeft
           
 
Constructor Summary
CrawlerToolkit()
           
 
Method Summary
static String cleanFromHtmlTags(String text)
          S�ubert HTML-Text von seinen Tags und wandelt alle HTML-Entit�ten in ihre Ensprechungen.
static String cleanURL(String url, String[] urlCleaners)
          Removes unwanted parts from the URL.
static String completeDirectory(String url)
          Completes an url which denotes a directory but doesnt end with a slash.
static String createURLFromProps(String[] parts)
           
static String createURLWithoutPath(String completeUrl)
          Extract left part of URL (protocol, host, port).
static String[] executeNativeCommand(String[] commandArr)
          Executes a native command and returns its output.
static String extractCredentialsFromProtocolHostFragment(String urlFragment)
          Extract the username, password from a given protocol, host-domain url fragment.
static AccountPasswordEntry findAuthenticationValuesForURL(String url, Map<String,AccountPasswordEntry> authMap)
          Sets the account and password for a URL if there is a account/password entry matching to the URL in the store.
static InputStream getHttpStream(URL url)
          Originally copied from javax.swing.JEditorPane#getStream(...).
static void initHttpClient(CrawlerConfig config)
          Initializes the HTTP client
static byte[] loadFile(File file)
          Loads a file from the file system and returns the content
static byte[] loadFileFromStream(InputStream inputStream, int length)
          Loads content from a InputStream and returns the content
static byte[] loadHttpDocument(String url)
          Lädt ein Dokument von einem HTTP-Server herunter und gibt seinen Inhalt zurück.
static void printActiveThreads()
          Prints the active threads to System.out.
static String removeAnchor(String url)
          Removes anchors from URLs like http://mydomain.com/index.html#anchor
static String replaceAuthenticationValuesInURL(String url, AccountPasswordEntry entry)
          Sets account and password for an URL
static String replaceHtmlEntities(String text)
          Wandelt alle HTML-Entit�ten in ihre Ensprechungen.
static String toAbsoluteUrl(String url, String parentUrl)
          Wandelt die gegebene HTTP-URL in eine absolute URL um.
private static String toCommand(String[] commandArr)
          Returns a human readable command string for a command.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class


urlPatternLeft

private static java.util.regex.Pattern urlPatternLeft
Constructor Detail

CrawlerToolkit

public CrawlerToolkit()
Method Detail

createURLFromProps

public static String createURLFromProps(String[] parts)

toCommand

private static String toCommand(String[] commandArr)
Returns a human readable command string for a command.

Parameters:
commandArr - The command separated in executable and parameters.
Returns:
The human readable command, where the parameters follow the execuable separated by spaces.

executeNativeCommand

public static String[] executeNativeCommand(String[] commandArr)
                                     throws RegainException
Executes a native command and returns its output.

Parameters:
commandArr - An array containing ehe command to execute and its parameters.
Returns:
The output of the command as arrays of lines.
Throws:
RegainException - If executing failed.

getHttpStream

public static InputStream getHttpStream(URL url)
                                 throws RedirectException,
                                        HttpStreamException
Originally copied from javax.swing.JEditorPane#getStream(...).

Fetches a stream for the given URL, which is about to be loaded by the setPage method. By default, this simply opens the URL and returns the stream. This can be reimplemented to do useful things like fetch the stream from a cache, monitor the progress of the stream, etc.

This method is expected to have the the side effect of establishing the content type, and therefore setting the appropriate EditorKit to use for loading the stream.

If this the stream was an http connection, redirects will be followed and the resulting URL will be set as the Document.StreamDescriptionProperty so that relative URL's can be properly resolved.

Parameters:
url - the URL of the page
Returns:
a stream reading data from the specified URL.
Throws:
RedirectException - if the URL redirects to another URL.
HttpStreamException - if something went wrong.

loadHttpDocument

public static byte[] loadHttpDocument(String url)
                               throws RegainException
Lädt ein Dokument von einem HTTP-Server herunter und gibt seinen Inhalt zurück.

Parameters:
url - Die URL des zu ladenden Dokuments.
Returns:
Den Inhalt des Dokuments.
Throws:
RegainException - Wenn das Laden fehl schlug.

loadFile

public static byte[] loadFile(File file)
                       throws RegainException
Loads a file from the file system and returns the content

Parameters:
file - The file to load
Returns:
byte[] The content of file
Throws:
RegainException - in case of problems while loading

loadFileFromStream

public static byte[] loadFileFromStream(InputStream inputStream,
                                        int length)
                                 throws RegainException
Loads content from a InputStream and returns the content

Parameters:
inputStream - the stream to read
Returns:
byte[] The content of the source
Throws:
RegainException - in case of problems while loading

toAbsoluteUrl

public static String toAbsoluteUrl(String url,
                                   String parentUrl)
Wandelt die gegebene HTTP-URL in eine absolute URL um.

Wenn die URL bereits absolut war, so wird sie unverändert zurückgegeben.

Parameters:
url - Die umzuwandelnde URL.
parentUrl - Die URL auf die sich die umzuwandelnde URL bezieht. Diese URL muss absolut sein.
Returns:
Die absolute Version der gegebenen URL.

completeDirectory

public static String completeDirectory(String url)
Completes an url which denotes a directory but doesnt end with a slash.

Parameters:
url - the URL to check and fix
Returns:
fixed URL

removeAnchor

public static String removeAnchor(String url)
Removes anchors from URLs like http://mydomain.com/index.html#anchor

Parameters:
url - an URL with or without an anchor
Returns:
the URL without an anchor

printActiveThreads

public static void printActiveThreads()
Prints the active threads to System.out. Usefull for debugging.


initHttpClient

public static void initHttpClient(CrawlerConfig config)
Initializes the HTTP client

Parameters:
config - The configuration to read the settings from.

replaceHtmlEntities

public static String replaceHtmlEntities(String text)
Wandelt alle HTML-Entit�ten in ihre Ensprechungen.

Parameters:
text - Den Text, dessen HTML-Entit�ten gewandelt werden sollen.
Returns:
Der gewandelte Text.

cleanFromHtmlTags

public static String cleanFromHtmlTags(String text)
S�ubert HTML-Text von seinen Tags und wandelt alle HTML-Entit�ten in ihre Ensprechungen.

Parameters:
text - Der zu s�ubernde HTML-Text.
Returns:
Der von Tags gesüberte Text

findAuthenticationValuesForURL

public static AccountPasswordEntry findAuthenticationValuesForURL(String url,
                                                                  Map<String,AccountPasswordEntry> authMap)
                                                           throws RegainException
Sets the account and password for a URL if there is a account/password entry matching to the URL in the store.

Parameters:
url - the url for enrichment
authMap - accountPasswordStore
Returns:
modified url
Throws:
RegainException

replaceAuthenticationValuesInURL

public static String replaceAuthenticationValuesInURL(String url,
                                                      AccountPasswordEntry entry)
Sets account and password for an URL

Parameters:
url - the URL for enrichment
entry - the account password entry
Returns:
URL with replacement
Throws:
RegainException

createURLWithoutPath

public static String createURLWithoutPath(String completeUrl)
                                   throws RegainException
Extract left part of URL (protocol, host, port).

Parameters:
completeUrl -
Returns:
the resulting URL (e.g. http://bl.dfs.dk:8080/mypath/fil.jsp?query will be http://bl.dfs.dk:8080/)
Throws:
RegainException

cleanURL

public static String cleanURL(String url,
                              String[] urlCleaners)
Removes unwanted parts from the URL.

Parameters:
url -
urlCleaners -
Returns:
cleaned URL

extractCredentialsFromProtocolHostFragment

public static String extractCredentialsFromProtocolHostFragment(String urlFragment)
Extract the username, password from a given protocol, host-domain url fragment. Example: http://tester:secret&host.sld.tld/

Parameters:
urlFragment - the fragment which contains protocol, optional user/pw and host+domain.
Returns:
the user:pw if it exist in the urlFragment

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info