Regain 2.1.0-STABLE API

net.sf.regain.crawler.preparator
Class HtmlPreparator

java.lang.Object
  extended by net.sf.regain.crawler.document.AbstractPreparator
      extended by net.sf.regain.crawler.preparator.HtmlPreparator
All Implemented Interfaces:
Pluggable, Preparator, WriteablePreparator

public class HtmlPreparator
extends AbstractPreparator

Prepares a HTML-document for indexing.

The document will be parsed and a title will be extracted.

Author:
Til Schneider, www.murfman.de

Field Summary
private  List<HtmlContentExtractor> mContentExtractorList
          Die HtmlContentExtractor, die den jeweiligen zu indizierenden Inhalt aus den HTML-Dokumenten schneiden.
private static org.apache.log4j.Logger mLog
          The logger for this class
private  List<HtmlPathExtractor> mPathExtractorList
          Die HtmlPathExtractor, die den jeweiligen Pfad aus den HTML-Dokumenten extrahieren.
 
Fields inherited from interface net.sf.regain.crawler.document.Preparator
DEFAULT_BUFFER_SIZE
 
Constructor Summary
HtmlPreparator()
          Creates a new instance of HtmlPreparator.
 
Method Summary
private  String extractHtmlTitle(String content)
          Extrahiert den Titel aus einem HTML-Dokument.
private  int getIntParam(Map<String,String> configSection, String paramName)
          Gets an int parameter from a configuration section
 void init(PreparatorConfig config)
          Initializes the preparator.
private  boolean isIndexOf(String content, String expected, int pos)
          Checks whether an expected substring is at a certain position.
 void prepare(RawDocument rawDocument)
          Prepares a document for indexing.
 
Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator
accepts, addAdditionalField, cleanUp, close, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class


mContentExtractorList

private List<HtmlContentExtractor> mContentExtractorList
Die HtmlContentExtractor, die den jeweiligen zu indizierenden Inhalt aus den HTML-Dokumenten schneiden.


mPathExtractorList

private List<HtmlPathExtractor> mPathExtractorList
Die HtmlPathExtractor, die den jeweiligen Pfad aus den HTML-Dokumenten extrahieren.

Constructor Detail

HtmlPreparator

public HtmlPreparator()
               throws RegainException
Creates a new instance of HtmlPreparator.

Throws:
RegainException - If creating the preparator failed.
Method Detail

init

public void init(PreparatorConfig config)
          throws RegainException
Initializes the preparator.

Specified by:
init in interface Pluggable
Overrides:
init in class AbstractPreparator
Parameters:
config - The configuration.
Throws:
RegainException - If the configuration has an error.

getIntParam

private int getIntParam(Map<String,String> configSection,
                        String paramName)
                 throws RegainException
Gets an int parameter from a configuration section

Parameters:
configSection - The configuration section to get the int param from.
paramName - The name of the parameter
Returns:
The value of the parameter.
Throws:
RegainException - If the parameter is not set or is not a number.

prepare

public void prepare(RawDocument rawDocument)
             throws RegainException
Prepares a document for indexing.

Parameters:
rawDocument - document which will be prepared
Throws:
RegainException - if something goes wrong while preparation

extractHtmlTitle

private String extractHtmlTitle(String content)
Extrahiert den Titel aus einem HTML-Dokument.

Parameters:
content - Der Inhalt (die HTML-Rohdaten) des Dokuments, dessen Titel ermittelt werden soll.
Returns:
Den Titel des HTML-Dokuments.

isIndexOf

private boolean isIndexOf(String content,
                          String expected,
                          int pos)
Checks whether an expected substring is at a certain position.

Parameters:
content - The String to check the excepted substring.
expected - The expected substring.
pos - The position where the substring is expected.
Returns:
Whether the expected substring is really at this position.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info