public class RuleBasedSegmenter extends Object implements SentenceSegmenter
| Modifier and Type | Field and Description |
|---|---|
static Pattern |
alphaNumParaLowerNum
Alphanumeric, maybe a space, paragraph mark, maybe a space, and lowercase
letter or digit.
|
static Pattern |
conventionalPara
End of sentence marker, one or more paragraph marks, maybe some starting
punctuation, uppercase.
|
static Pattern |
doubleLineBreak
Two lines.
|
static Pattern |
endInsideQuotesPara
End of sentence marker, maybe a space, punctuation (quotes, brackets),
space, maybe some more punctuation, maybe some space and uppercase.
|
static Pattern |
endInsideQuotesSpace
End of sentence marker, maybe a space, punctuation (quotes, brackets),
space, maybe some more punctuation, maybe some space and uppercase.
|
static Pattern |
endPunctLinkPara
End of sentence markers, paragraph mark and link.
|
static Pattern |
endPunctLinkSpace
End of sentence punctuation, maybe spaces and link.
|
static String |
FINAL_PUNCT
Final punctuation in unicode.
|
static String |
INITIAL_PUNCT
Initial punctuation in unicode.
|
static String |
LINE_BREAK
The constant representing every line break in the original input text.
|
static Pattern |
lineBreak
Line break pattern.
|
static Pattern |
multiDotsParaStarters
Multi-dots, paragraph mark, sentence starters and uppercase.
|
static Pattern |
multiDotsSpaceStarters
Multi-dots, space, sentence starters and uppercase.
|
static Pattern |
noPeriodSpaceEnd
Non-period end of sentence markers (?!), one or more spaces, sentence
starters.
|
static Pattern |
paragraph
Paragraph pattern.
|
static String |
PARAGRAPH
Constant representing a paragraph (a doubleLine) in the original input
text.
|
static Pattern |
punctSpaceUpper
End of sentence marker, sentence starter punctuation and upper case.
|
static Pattern |
spuriousParagraph
If paragraph mark, maybe some space and lowercase or punctuation (not start
of sentence markers) then it is a spurious paragraph.
|
| Constructor and Description |
|---|
RuleBasedSegmenter(String originalText,
Properties properties)
Construct a RuleBasedSegmenter from a BufferedReader and the properties.
|
| Modifier and Type | Method and Description |
|---|---|
static String |
buildText(String text) |
String[] |
segmentSentence() |
public static final String LINE_BREAK
public static final String PARAGRAPH
public static Pattern lineBreak
public static Pattern doubleLineBreak
public static Pattern paragraph
public static String INITIAL_PUNCT
public static String FINAL_PUNCT
public static Pattern endPunctLinkPara
public static Pattern conventionalPara
public static Pattern endInsideQuotesPara
public static Pattern multiDotsParaStarters
public static Pattern spuriousParagraph
public static Pattern alphaNumParaLowerNum
public static Pattern noPeriodSpaceEnd
public static Pattern multiDotsSpaceStarters
public static Pattern endInsideQuotesSpace
public static Pattern punctSpaceUpper
public static Pattern endPunctLinkSpace
public RuleBasedSegmenter(String originalText, Properties properties)
originalText - the text to be segmentedproperties - the propertiespublic String[] segmentSentence()
segmentSentence in interface SentenceSegmenterCopyright © 2015 IXA pipes. All rights reserved.