| 
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.sf.regain.crawler.document.AbstractPreparator
net.sf.regain.crawler.preparator.HtmlPreparator
public class HtmlPreparator
Prepares a HTML-document for indexing.
The document will be parsed and a title will be extracted.
| Field Summary | |
|---|---|
private  List<HtmlContentExtractor> | 
mContentExtractorList
Die HtmlContentExtractor, die den jeweiligen zu indizierenden Inhalt aus den HTML-Dokumenten schneiden.  | 
private static org.apache.log4j.Logger | 
mLog
The logger for this class  | 
private  List<HtmlPathExtractor> | 
mPathExtractorList
Die HtmlPathExtractor, die den jeweiligen Pfad aus den HTML-Dokumenten extrahieren.  | 
| Fields inherited from interface net.sf.regain.crawler.document.Preparator | 
|---|
DEFAULT_BUFFER_SIZE | 
| Constructor Summary | |
|---|---|
HtmlPreparator()
Creates a new instance of HtmlPreparator.  | 
|
| Method Summary | |
|---|---|
private  String | 
extractHtmlTitle(String content)
Extrahiert den Titel aus einem HTML-Dokument.  | 
private  int | 
getIntParam(Map<String,String> configSection,
            String paramName)
Gets an int parameter from a configuration section  | 
 void | 
init(PreparatorConfig config)
Initializes the preparator.  | 
private  boolean | 
isIndexOf(String content,
          String expected,
          int pos)
Checks whether an expected substring is at a certain position.  | 
 void | 
prepare(RawDocument rawDocument)
Prepares a document for indexing.  | 
| Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator | 
|---|
accepts, addAdditionalField, cleanUp, close, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex | 
| Methods inherited from class java.lang.Object | 
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Field Detail | 
|---|
private static org.apache.log4j.Logger mLog
private List<HtmlContentExtractor> mContentExtractorList
private List<HtmlPathExtractor> mPathExtractorList
| Constructor Detail | 
|---|
public HtmlPreparator()
               throws RegainException
RegainException - If creating the preparator failed.| Method Detail | 
|---|
public void init(PreparatorConfig config)
          throws RegainException
init in interface Pluggableinit in class AbstractPreparatorconfig - The configuration.
RegainException - If the configuration has an error.
private int getIntParam(Map<String,String> configSection,
                        String paramName)
                 throws RegainException
configSection - The configuration section to get the int param from.paramName - The name of the parameter
RegainException - If the parameter is not set or is not a number.
public void prepare(RawDocument rawDocument)
             throws RegainException
rawDocument - document which will be prepared
RegainException - if something goes wrong while preparationprivate String extractHtmlTitle(String content)
content - Der Inhalt (die HTML-Rohdaten) des Dokuments, dessen Titel ermittelt werden soll.
private boolean isIndexOf(String content,
                          String expected,
                          int pos)
content - The String to check the excepted substring.expected - The expected substring.pos - The position where the substring is expected.
  | 
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||