Regain 2.1.0-STABLE API

net.sf.regain.crawler.preparator
Class MessagePreparator

java.lang.Object
  extended by net.sf.regain.crawler.document.AbstractPreparator
      extended by net.sf.regain.crawler.preparator.MessagePreparator
All Implemented Interfaces:
Pluggable, Preparator, WriteablePreparator

public class MessagePreparator
extends AbstractPreparator

This class prepares messages (MIME, rfc822), specifically spoof email messages.

The document contains the message text and the file names of the attachments.

Author:
Thomas Tesche, www.thtesche.com, Kevin Black (KJB)
See Also:
MessagePreparator

Field Summary
private static org.apache.log4j.Logger mLog
          The logger for this class
private static java.util.regex.Pattern mURLPattern
          Regex Compilation to match URLs in body.
 
Fields inherited from interface net.sf.regain.crawler.document.Preparator
DEFAULT_BUFFER_SIZE
 
Constructor Summary
MessagePreparator()
          Creates a new instance of MessagePreparator.
 
Method Summary
private  Collection<String> extractURLs(String text)
          Extract URLs from text source.
private  javax.mail.Address[] fixAddress(javax.mail.internet.AddressException ae, javax.mail.internet.MimeMessage message, String headerName)
          Occasionally see Addresses that have semi-colons rather than commas, which cause "Illegal semicolon, not in group" AddressException.
static String inputStreamAsString(InputStream stream)
          Get the content of an InputStream as String.
 void prepare(RawDocument rawDocument)
          Prepares the document for indexing.
private  String stripNoneWordChars(String uncleanString)
          Removes unwanted chars from a given string.
 
Methods inherited from class net.sf.regain.crawler.document.AbstractPreparator
accepts, addAdditionalField, cleanUp, close, concatenateStringParts, getAdditionalFields, getCleanedContent, getCleanedMetaData, getHeadlines, getPath, getPriority, getSummary, getTitle, init, setCleanedContent, setCleanedMetaData, setHeadlines, setPath, setPriority, setSummary, setTitle, setUrlRegex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mLog

private static org.apache.log4j.Logger mLog
The logger for this class


mURLPattern

private static java.util.regex.Pattern mURLPattern
Regex Compilation to match URLs in body.

Constructor Detail

MessagePreparator

public MessagePreparator()
                  throws RegainException
Creates a new instance of MessagePreparator.

Throws:
RegainException - If creating of the preparator failed.
Method Detail

prepare

public void prepare(RawDocument rawDocument)
             throws RegainException
Prepares the document for indexing.

Parameters:
rawDocument - The document to prepare.
Throws:
RegainException - If the preparation fails.

fixAddress

private javax.mail.Address[] fixAddress(javax.mail.internet.AddressException ae,
                                        javax.mail.internet.MimeMessage message,
                                        String headerName)
Occasionally see Addresses that have semi-colons rather than commas, which cause "Illegal semicolon, not in group" AddressException. This helper function attempts to change semi-colons to commas and return the Address Array.

Parameters:
ae - Address Exception object
message - MIME Message object
headerName - Name of header, e.g. To, From, Reply-To
Returns:
if available, array of Addresses; otherwise, null

extractURLs

private Collection<String> extractURLs(String text)
Extract URLs from text source. Tried org.htmlparser functions here (eg. http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html) but that technique missed quite a few URLs. Using the RegEx technique.

Parameters:
text - input string of text or HTML
Returns:
Collection of strings matching https?|ftp|mailto

stripNoneWordChars

private String stripNoneWordChars(String uncleanString)
Removes unwanted chars from a given string.

Parameters:
uncleanString -
Returns:

inputStreamAsString

public static String inputStreamAsString(InputStream stream)
                                  throws IOException
Get the content of an InputStream as String.

Parameters:
stream - the InputStream
Returns:
the convertet content as String
Throws:
IOException

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info