|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.regain.crawler.preparator.util.StripEntities
public class StripEntities
Strips HTML entities such as " from a file, replacing them by their Unicode equivalents. Methods can be used on text strings as well. Does not strip Tags, just Entities. No longer requires entitiestochar.ser in the jar!
Field Summary | |
---|---|
private static boolean |
DEBUGGING
true to enable the testing code. |
private static HashMap |
entityToChar
allows lookup by entity name, to get the corresponding char. |
static int |
LONGEST_ENTITY
Longest an entity can be 10, at least in our tables, including the lead & and trail ; |
static int |
SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ; |
Constructor Summary | |
---|---|
StripEntities()
|
Method Summary | |
---|---|
static char |
entityToChar(String entity)
convert an entity to a single char |
static void |
main(String[] args)
Test harness |
static char |
possEntityToChar(String possEntity)
Checks a number of gauntlet conditions to ensure this is a valid entity. |
static String |
stripEntities(String text)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. |
static String |
stripHTMLTags(String html)
Removes tags from HTML leaving just the raw text. |
static String |
stripNbsp(String text)
converts all 160-style spaces (result of stripEntities on ) to ordinary space. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int LONGEST_ENTITY
public static final int SHORTEST_ENTITY
private static final boolean DEBUGGING
private static HashMap entityToChar
Constructor Detail |
---|
public StripEntities()
Method Detail |
---|
public static String stripNbsp(String text)
text
- Text to convert
public static void main(String[] args)
args
- not used.public static String stripEntities(String text)
text
- raw text to be processed. Must not be null.
public static char possEntityToChar(String possEntity)
possEntity
- string that may hold an entity. Lead & must be stripped, but may
contain text past the ;
public static char entityToChar(String entity)
entity
- String entity to convert convert. must have lead & and trail ;
stripped; may be a x#123 or #123 style entity. Works faster if
entity in lower case.
public static String stripHTMLTags(String html)
html
- input HTML
|
Regain 2.1.0-STABLE API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |