| 
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.sf.regain.crawler.preparator.util.StripEntities
public class StripEntities
                 Strips HTML entities such as " from a file, replacing them by their
                 Unicode equivalents. Methods can be used on text strings as well. Does not
                 strip Tags, just Entities. No longer requires entitiestochar.ser in the jar!
 
| Field Summary | |
|---|---|
private static boolean | 
DEBUGGING
true to enable the testing code.  | 
private static HashMap | 
entityToChar
allows lookup by entity name, to get the corresponding char.  | 
static int | 
LONGEST_ENTITY
Longest an entity can be 10, at least in our tables, including the lead & and trail ;  | 
static int | 
SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;  | 
| Constructor Summary | |
|---|---|
StripEntities()
 | 
|
| Method Summary | |
|---|---|
static char | 
entityToChar(String entity)
convert an entity to a single char  | 
static void | 
main(String[] args)
Test harness  | 
static char | 
possEntityToChar(String possEntity)
Checks a number of gauntlet conditions to ensure this is a valid entity.  | 
static String | 
stripEntities(String text)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.  | 
static String | 
stripHTMLTags(String html)
Removes tags from HTML leaving just the raw text.  | 
static String | 
stripNbsp(String text)
converts all 160-style spaces (result of stripEntities on ) to ordinary space.  | 
| Methods inherited from class java.lang.Object | 
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Field Detail | 
|---|
public static final int LONGEST_ENTITY
public static final int SHORTEST_ENTITY
private static final boolean DEBUGGING
private static HashMap entityToChar
| Constructor Detail | 
|---|
public StripEntities()
| Method Detail | 
|---|
public static String stripNbsp(String text)
text - Text to convert
public static void main(String[] args)
args - not used.public static String stripEntities(String text)
text - raw text to be processed. Must not be null.
public static char possEntityToChar(String possEntity)
possEntity - string that may hold an entity. Lead & must be stripped, but may
                   contain text past the ;
public static char entityToChar(String entity)
entity - String entity to convert convert. must have lead & and trail ;
               stripped; may be a x#123 or #123 style entity. Works faster if
               entity in lower case.
public static String stripHTMLTags(String html)
html - input HTML
  | 
Regain 2.1.0-STABLE API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||