|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.document.AbstractPreparator
net.sf.regain.crawler.preparator.AbstractJacobMsOfficePreparator
net.sf.regain.crawler.preparator.JacobMsWordPreparator
public class JacobMsWordPreparator
Präpariert ein Microsoft-Word-Dokument für die Indizierung mit Hilfe der Jacob-API, wobei Jacobgen genutzt wurde, um den Zugriff zu erleichtern.
Dabei werden die Rohdaten des Dokuments von Formatierungsinformation befreit, es wird der Titel extrahiert.
Field Summary | |
---|---|
private HashSet |
mHeadlineStyleNameSet
The word style names (style == format template) that are used by paragraphs holding a headline. |
private static org.apache.log4j.Logger |
mLog
The logger for this class |
private de.filiadata.lucene.spider.generated.msoffice2000.word.Application |
mWordApplication
The word application. |
Fields inherited from interface net.sf.regain.crawler.document.Preparator |
---|
DEFAULT_BUFFER_SIZE |
Constructor Summary | |
---|---|
JacobMsWordPreparator()
Creates a new instance of JacobMsPowerPointPreparator. |
Method Summary | |
---|---|
private void |
appendShape(de.filiadata.lucene.spider.generated.msoffice2000.word.Shape shape,
StringBuffer buffer)
Appends the text content of a shape to a StringBuffer. |
void |
close()
Frees all resources reserved by the preparator. |
private String |
getSelection(de.filiadata.lucene.spider.generated.msoffice2000.word.Application wordAppl)
Gets the currently selected text from a Word application. |
void |
init(PreparatorConfig config)
Initializes the preparator. |
void |
prepare(RawDocument rawDocument)
Präpariert ein Dokument für die Indizierung. |
private String |
removeBinaryStuff(String text)
Removes all characters that are less that 32 from the given String |
Methods inherited from class net.sf.regain.crawler.preparator.AbstractJacobMsOfficePreparator |
---|
readProperties |
Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator |
---|
accepts, addAdditionalField, cleanUp, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static org.apache.log4j.Logger mLog
private de.filiadata.lucene.spider.generated.msoffice2000.word.Application mWordApplication
null
as long as no document was
processed.
private HashSet mHeadlineStyleNameSet
null
if no headline styles were
configured.
Constructor Detail |
---|
public JacobMsWordPreparator() throws RegainException
RegainException
- If creating the preparator failed.Method Detail |
---|
public void init(PreparatorConfig config) throws RegainException
init
in interface Pluggable
init
in class AbstractJacobMsOfficePreparator
config
- The configuration
RegainException
- If the configuration has an error.public void prepare(RawDocument rawDocument) throws RegainException
rawDocument
- Das zu pr�pariernde Dokument.
RegainException
- Wenn die Pr�paration fehl schlug.private String getSelection(de.filiadata.lucene.spider.generated.msoffice2000.word.Application wordAppl)
wordAppl
- The Word application to get the selected text from.
private void appendShape(de.filiadata.lucene.spider.generated.msoffice2000.word.Shape shape, StringBuffer buffer)
shape
- The shape to add.buffer
- The buffer where to append the textprivate String removeBinaryStuff(String text)
text
- The String where to remove the binary stuff.
public void close() throws RegainException
Is called at the end of the crawler process after all documents were processed.
close
in interface Preparator
close
in class AbstractPreparator
RegainException
- If freeing the resources failed.
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |