Regain 2.1.0-STABLE API

net.sf.regain.crawler.preparator
Class JacobMsWordPreparator

java.lang.Object
  extended by net.sf.regain.crawler.document.AbstractPreparator
      extended by net.sf.regain.crawler.preparator.AbstractJacobMsOfficePreparator
          extended by net.sf.regain.crawler.preparator.JacobMsWordPreparator
All Implemented Interfaces:
Pluggable, Preparator, WriteablePreparator

public class JacobMsWordPreparator
extends AbstractJacobMsOfficePreparator

Präpariert ein Microsoft-Word-Dokument für die Indizierung mit Hilfe der Jacob-API, wobei Jacobgen genutzt wurde, um den Zugriff zu erleichtern.

Dabei werden die Rohdaten des Dokuments von Formatierungsinformation befreit, es wird der Titel extrahiert.

Author:
Til Schneider, www.murfman.de

Field Summary
private  HashSet mHeadlineStyleNameSet
          The word style names (style == format template) that are used by paragraphs holding a headline.
private static org.apache.log4j.Logger mLog
          The logger for this class
private  de.filiadata.lucene.spider.generated.msoffice2000.word.Application mWordApplication
          The word application.
 
Fields inherited from interface net.sf.regain.crawler.document.Preparator
DEFAULT_BUFFER_SIZE
 
Constructor Summary
JacobMsWordPreparator()
          Creates a new instance of JacobMsPowerPointPreparator.
 
Method Summary
private  void appendShape(de.filiadata.lucene.spider.generated.msoffice2000.word.Shape shape, StringBuffer buffer)
          Appends the text content of a shape to a StringBuffer.
 void close()
          Frees all resources reserved by the preparator.
private  String getSelection(de.filiadata.lucene.spider.generated.msoffice2000.word.Application wordAppl)
          Gets the currently selected text from a Word application.
 void init(PreparatorConfig config)
          Initializes the preparator.
 void prepare(RawDocument rawDocument)
          Präpariert ein Dokument für die Indizierung.
private  String removeBinaryStuff(String text)
          Removes all characters that are less that 32 from the given String
 
Methods inherited from class net.sf.regain.crawler.preparator.AbstractJacobMsOfficePreparator
readProperties
 
Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator
accepts, addAdditionalField, cleanUp, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class


mWordApplication

private de.filiadata.lucene.spider.generated.msoffice2000.word.Application mWordApplication
The word application. Is null as long as no document was processed.


mHeadlineStyleNameSet

private HashSet mHeadlineStyleNameSet
The word style names (style == format template) that are used by paragraphs holding a headline. Is null if no headline styles were configured.

Constructor Detail

JacobMsWordPreparator

public JacobMsWordPreparator()
                      throws RegainException
Creates a new instance of JacobMsPowerPointPreparator.

Throws:
RegainException - If creating the preparator failed.
Method Detail

init

public void init(PreparatorConfig config)
          throws RegainException
Initializes the preparator.

Specified by:
init in interface Pluggable
Overrides:
init in class AbstractJacobMsOfficePreparator
Parameters:
config - The configuration
Throws:
RegainException - If the configuration has an error.

prepare

public void prepare(RawDocument rawDocument)
             throws RegainException
Präpariert ein Dokument für die Indizierung.

Parameters:
rawDocument - Das zu pr�pariernde Dokument.
Throws:
RegainException - Wenn die Pr�paration fehl schlug.

getSelection

private String getSelection(de.filiadata.lucene.spider.generated.msoffice2000.word.Application wordAppl)
Gets the currently selected text from a Word application.

Parameters:
wordAppl - The Word application to get the selected text from.
Returns:
The currently selected text.

appendShape

private void appendShape(de.filiadata.lucene.spider.generated.msoffice2000.word.Shape shape,
                         StringBuffer buffer)
Appends the text content of a shape to a StringBuffer.

Parameters:
shape - The shape to add.
buffer - The buffer where to append the text

removeBinaryStuff

private String removeBinaryStuff(String text)
Removes all characters that are less that 32 from the given String

Parameters:
text - The String where to remove the binary stuff.
Returns:
The cleaned String.

close

public void close()
           throws RegainException
Frees all resources reserved by the preparator.

Is called at the end of the crawler process after all documents were processed.

Specified by:
close in interface Preparator
Overrides:
close in class AbstractPreparator
Throws:
RegainException - If freeing the resources failed.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info