public class NonPeriodBreaker extends Object
| Modifier and Type | Field and Description |
|---|---|
static Pattern |
acronym
General acronyms.
|
static Pattern |
alphabetic
Any alphabetic character.
|
static Pattern |
nextCandidateWord
Next word wrt to the candidate to indicate sentence breaker.
|
static String |
NON_BREAKER_DIGITS
Do not split dot after these words if followed by number.
|
static Pattern |
nonSegmentedWords
Non segmented words, candidates for sentence breaking.
|
static Pattern |
numbers
Do not segment numbers like 11.1.
|
static Pattern |
startDigit
Starts with a digit.
|
static Pattern |
startLower
Starts with a lowercase.
|
static Pattern |
startPunct
Starts with punctuation that is not beginning of sentence marker.
|
static Pattern |
wordDot
Any non white space followed by a period.
|
| Constructor and Description |
|---|
NonPeriodBreaker(Properties properties)
This constructor reads some non breaking prefixes files in resources to
create exceptions of segmentation and tokenization.
|
| Modifier and Type | Method and Description |
|---|---|
String[] |
segmenterExceptions(String[] lines)
Segment the rest of the text taking into account some exceptions for
periods as sentence breakers.
|
String |
TokenizerNonBreaker(String line)
It decides when periods do not need to be tokenized.
|
public static Pattern nonSegmentedWords
public static Pattern nextCandidateWord
public static String NON_BREAKER_DIGITS
public static Pattern acronym
public static Pattern numbers
public static Pattern wordDot
public static Pattern alphabetic
public static Pattern startLower
public static Pattern startPunct
public static Pattern startDigit
public NonPeriodBreaker(Properties properties)
properties - the optionspublic String[] segmenterExceptions(String[] lines)
lines - the segmented sentences so farCopyright © 2017 IXA pipes. All rights reserved.