Regain 2.1.0-STABLE API

net.sf.regain.crawler.preparator.util
Class StripEntities

java.lang.Object
  extended by net.sf.regain.crawler.preparator.util.StripEntities

public class StripEntities
extends Object

                 Strips HTML entities such as " from a file, replacing them by their
                 Unicode equivalents. Methods can be used on text strings as well. Does not
                 strip Tags, just Entities. No longer requires entitiestochar.ser in the jar!
 

Since:
2002 July 14 version

version 1.0 - initial version

version 1.1 - optimise using text.indexOf('&') and sb.append(string) rather than processing character by character.

version 1.2 2004-07-21 - add stripHTMLTags - stripFile also strips tags - add stripNbsp

version 1.3 2005-06-20 - fix bug in possEntityToChar - exposed possEntityToChar as public

Version 1.4 2005-07-02 - check for null input

Version 1.5 2005-07-29 - no longer needs entitiestochar.ser file. Converted to JDK 1.5 back to 1,2 Version 1.6 2005-09-05 - faster code for stripHTMLTags that returns original string if nothing changed.

Version:
1.6
Author:
Roedy Green

Field Summary
private static boolean DEBUGGING
          true to enable the testing code.
private static HashMap entityToChar
          allows lookup by entity name, to get the corresponding char.
static int LONGEST_ENTITY
          Longest an entity can be 10, at least in our tables, including the lead & and trail ;
static int SHORTEST_ENTITY
          The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;
 
Constructor Summary
StripEntities()
           
 
Method Summary
static char entityToChar(String entity)
          convert an entity to a single char
static void main(String[] args)
          Test harness
static char possEntityToChar(String possEntity)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static String stripEntities(String text)
          Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static String stripHTMLTags(String html)
          Removes tags from HTML leaving just the raw text.
static String stripNbsp(String text)
          converts all 160-style spaces (result of stripEntities on  ) to ordinary space.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LONGEST_ENTITY

public static final int LONGEST_ENTITY
Longest an entity can be 10, at least in our tables, including the lead & and trail ;

See Also:
Constant Field Values

SHORTEST_ENTITY

public static final int SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;

See Also:
Constant Field Values

DEBUGGING

private static final boolean DEBUGGING
true to enable the testing code.

See Also:
Constant Field Values

entityToChar

private static HashMap entityToChar
allows lookup by entity name, to get the corresponding char.

Constructor Detail

StripEntities

public StripEntities()
Method Detail

stripNbsp

public static String stripNbsp(String text)
converts all 160-style spaces (result of stripEntities on  ) to ordinary space.

Parameters:
text - Text to convert
Returns:
Text with 160-style spaces converted to ordinary spaces

main

public static void main(String[] args)
Test harness

Parameters:
args - not used.

stripEntities

public static String stripEntities(String text)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.

Parameters:
text - raw text to be processed. Must not be null.
Returns:
translated text. It also handles HTML 4.0 entities such as ♥ { and &x#123;   -> 160. null input returns null.

possEntityToChar

public static char possEntityToChar(String possEntity)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possEntity - string that may hold an entity. Lead & must be stripped, but may contain text past the ;
Returns:
corresponding unicode character, or 0 if the entity is invalid.

entityToChar

public static char entityToChar(String entity)
convert an entity to a single char

Parameters:
entity - String entity to convert convert. must have lead & and trail ; stripped; may be a x#123 or #123 style entity. Works faster if entity in lower case.
Returns:
equivalent character. 0 if not recognised.

stripHTMLTags

public static String stripHTMLTags(String html)
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes comments

Parameters:
html - input HTML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

Regain 2.1.0-STABLE API

Regain 2.1.0-STABLE, Copyright (C) 2004-2010 Til Schneider, www.murfman.de, Thomas Tesche, www.clustersystems.info