public class HtmlSentenceExtractor extends SentenceExtractor
Extractor class for extracting NpChunkedSentence objects from a
String containing HTML. Is backed by an OpenNLP SentenceDetector object.
Uses the code in HtmlUtils to extract plain text from HTML.| Constructor and Description |
|---|
HtmlSentenceExtractor()
Constructs a new
HtmlSentenceExtractor object using the default OpenNLP
SentenceDetector object, as returned by DefaultObjects.getDefaultSentenceDetector(). |
HtmlSentenceExtractor(opennlp.tools.sentdetect.SentenceDetector detector)
Constructs a new
SentenceExtractor object using the given OpenNLP SentenceDetector
object. |
| Modifier and Type | Method and Description |
|---|---|
protected Collection<String> |
extractCandidates(String htmlBlock)
Runs the OpenNLP
SentenceDetector object on the given String source,
and returns an Iterable object over the detected sentences. |
static void |
main(String[] args)
Extracts sentences from HTML passed via standard input, or through a file given as an argument
to the program.
|
getSentenceDetectoraddMapper, compose, extract, getMapperspublic HtmlSentenceExtractor(opennlp.tools.sentdetect.SentenceDetector detector)
SentenceExtractor object using the given OpenNLP SentenceDetector
object.detector - public HtmlSentenceExtractor()
throws IOException
HtmlSentenceExtractor object using the default OpenNLP
SentenceDetector object, as returned by DefaultObjects.getDefaultSentenceDetector().IOExceptionprotected Collection<String> extractCandidates(String htmlBlock)
SentenceExtractorSentenceDetector object on the given String source,
and returns an Iterable object over the detected sentences.extractCandidates in class SentenceExtractorhtmlBlock - the source to extract from.public static void main(String[] args) throws Exception
BracketsRemover mapper class,
and filters sentences using the SentenceEndFilter, SentenceStartFilter, and
SentenceLengthFilter mapper classes. Prints the resulting sentences to standard output,
one sentence per line.args - ExceptionCopyright © 2010-2013 University of Washington CSE. All Rights Reserved.