|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.document.AbstractPreparator
net.sf.regain.crawler.preparator.HtmlPreparator
public class HtmlPreparator
Prepares a HTML-document for indexing.
The document will be parsed and a title will be extracted.
Field Summary | |
---|---|
private List<HtmlContentExtractor> |
mContentExtractorList
Die HtmlContentExtractor, die den jeweiligen zu indizierenden Inhalt aus den HTML-Dokumenten schneiden. |
private static org.apache.log4j.Logger |
mLog
The logger for this class |
private List<HtmlPathExtractor> |
mPathExtractorList
Die HtmlPathExtractor, die den jeweiligen Pfad aus den HTML-Dokumenten extrahieren. |
Fields inherited from interface net.sf.regain.crawler.document.Preparator |
---|
DEFAULT_BUFFER_SIZE |
Constructor Summary | |
---|---|
HtmlPreparator()
Creates a new instance of HtmlPreparator. |
Method Summary | |
---|---|
private String |
extractHtmlTitle(String content)
Extrahiert den Titel aus einem HTML-Dokument. |
private int |
getIntParam(Map<String,String> configSection,
String paramName)
Gets an int parameter from a configuration section |
void |
init(PreparatorConfig config)
Initializes the preparator. |
private boolean |
isIndexOf(String content,
String expected,
int pos)
Checks whether an expected substring is at a certain position. |
void |
prepare(RawDocument rawDocument)
Prepares a document for indexing. |
Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator |
---|
accepts, addAdditionalField, cleanUp, close, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static org.apache.log4j.Logger mLog
private List<HtmlContentExtractor> mContentExtractorList
private List<HtmlPathExtractor> mPathExtractorList
Constructor Detail |
---|
public HtmlPreparator() throws RegainException
RegainException
- If creating the preparator failed.Method Detail |
---|
public void init(PreparatorConfig config) throws RegainException
init
in interface Pluggable
init
in class AbstractPreparator
config
- The configuration.
RegainException
- If the configuration has an error.private int getIntParam(Map<String,String> configSection, String paramName) throws RegainException
configSection
- The configuration section to get the int param from.paramName
- The name of the parameter
RegainException
- If the parameter is not set or is not a number.public void prepare(RawDocument rawDocument) throws RegainException
rawDocument
- document which will be prepared
RegainException
- if something goes wrong while preparationprivate String extractHtmlTitle(String content)
content
- Der Inhalt (die HTML-Rohdaten) des Dokuments, dessen Titel ermittelt werden soll.
private boolean isIndexOf(String content, String expected, int pos)
content
- The String to check the excepted substring.expected
- The expected substring.pos
- The position where the substring is expected.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |