|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.CrawlerToolkit
public class CrawlerToolkit
Contains help methods for the crawler.
Field Summary | |
---|---|
private static org.apache.log4j.Logger |
mLog
The logger for this class |
private static java.util.regex.Pattern |
urlPatternLeft
|
Constructor Summary | |
---|---|
CrawlerToolkit()
|
Method Summary | |
---|---|
static String |
cleanFromHtmlTags(String text)
S�ubert HTML-Text von seinen Tags und wandelt alle HTML-Entit�ten in ihre Ensprechungen. |
static String |
cleanURL(String url,
String[] urlCleaners)
Removes unwanted parts from the URL. |
static String |
completeDirectory(String url)
Completes an url which denotes a directory but doesnt end with a slash. |
static String |
createURLFromProps(String[] parts)
|
static String |
createURLWithoutPath(String completeUrl)
Extract left part of URL (protocol, host, port). |
static String[] |
executeNativeCommand(String[] commandArr)
Executes a native command and returns its output. |
static String |
extractCredentialsFromProtocolHostFragment(String urlFragment)
Extract the username, password from a given protocol, host-domain url fragment. |
static AccountPasswordEntry |
findAuthenticationValuesForURL(String url,
Map<String,AccountPasswordEntry> authMap)
Sets the account and password for a URL if there is a account/password entry matching to the URL in the store. |
static InputStream |
getHttpStream(URL url)
Originally copied from javax.swing.JEditorPane#getStream(...). |
static void |
initHttpClient(CrawlerConfig config)
Initializes the HTTP client |
static byte[] |
loadFile(File file)
Loads a file from the file system and returns the content |
static byte[] |
loadFileFromStream(InputStream inputStream,
int length)
Loads content from a InputStream and returns the content |
static byte[] |
loadHttpDocument(String url)
Lädt ein Dokument von einem HTTP-Server herunter und gibt seinen Inhalt zurück. |
static void |
printActiveThreads()
Prints the active threads to System.out. |
static String |
removeAnchor(String url)
Removes anchors from URLs like http://mydomain.com/index.html#anchor |
static String |
replaceAuthenticationValuesInURL(String url,
AccountPasswordEntry entry)
Sets account and password for an URL |
static String |
replaceHtmlEntities(String text)
Wandelt alle HTML-Entit�ten in ihre Ensprechungen. |
static String |
toAbsoluteUrl(String url,
String parentUrl)
Wandelt die gegebene HTTP-URL in eine absolute URL um. |
private static String |
toCommand(String[] commandArr)
Returns a human readable command string for a command. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static org.apache.log4j.Logger mLog
private static java.util.regex.Pattern urlPatternLeft
Constructor Detail |
---|
public CrawlerToolkit()
Method Detail |
---|
public static String createURLFromProps(String[] parts)
private static String toCommand(String[] commandArr)
commandArr
- The command separated in executable and parameters.
public static String[] executeNativeCommand(String[] commandArr) throws RegainException
commandArr
- An array containing ehe command to execute and its parameters.
RegainException
- If executing failed.public static InputStream getHttpStream(URL url) throws RedirectException, HttpStreamException
Fetches a stream for the given URL, which is about to
be loaded by the setPage
method. By
default, this simply opens the URL and returns the
stream. This can be reimplemented to do useful things
like fetch the stream from a cache, monitor the progress
of the stream, etc.
This method is expected to have the the side effect of
establishing the content type, and therefore setting the
appropriate EditorKit
to use for loading the stream.
If this the stream was an http connection, redirects
will be followed and the resulting URL will be set as
the Document.StreamDescriptionProperty
so that relative
URL's can be properly resolved.
url
- the URL of the page
RedirectException
- if the URL redirects to another URL.
HttpStreamException
- if something went wrong.public static byte[] loadHttpDocument(String url) throws RegainException
url
- Die URL des zu ladenden Dokuments.
RegainException
- Wenn das Laden fehl schlug.public static byte[] loadFile(File file) throws RegainException
file
- The file to load
RegainException
- in case of problems while loadingpublic static byte[] loadFileFromStream(InputStream inputStream, int length) throws RegainException
inputStream
- the stream to read
RegainException
- in case of problems while loadingpublic static String toAbsoluteUrl(String url, String parentUrl)
Wenn die URL bereits absolut war, so wird sie unverändert zurückgegeben.
url
- Die umzuwandelnde URL.parentUrl
- Die URL auf die sich die umzuwandelnde URL bezieht. Diese
URL muss absolut sein.
public static String completeDirectory(String url)
url
- the URL to check and fix
public static String removeAnchor(String url)
url
- an URL with or without an anchor
public static void printActiveThreads()
public static void initHttpClient(CrawlerConfig config)
config
- The configuration to read the settings from.public static String replaceHtmlEntities(String text)
text
- Den Text, dessen HTML-Entit�ten gewandelt werden sollen.
public static String cleanFromHtmlTags(String text)
text
- Der zu s�ubernde HTML-Text.
public static AccountPasswordEntry findAuthenticationValuesForURL(String url, Map<String,AccountPasswordEntry> authMap) throws RegainException
url
- the url for enrichmentauthMap
- accountPasswordStore
RegainException
public static String replaceAuthenticationValuesInURL(String url, AccountPasswordEntry entry)
url
- the URL for enrichmententry
- the account password entry
RegainException
public static String createURLWithoutPath(String completeUrl) throws RegainException
completeUrl
-
RegainException
public static String cleanURL(String url, String[] urlCleaners)
url
- urlCleaners
-
public static String extractCredentialsFromProtocolHostFragment(String urlFragment)
urlFragment
- the fragment which contains protocol, optional user/pw and host+domain.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |