Package org.languagetool.tokenizers.ca
Class CatalanWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.ca.CatalanWordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token.
Special treatment for hyphens and apostrophes in Catalan.
- Author:
- Jaume OrtolĂ
-
Field Summary
FieldsFields inherited from class org.languagetool.tokenizers.WordTokenizer
REMOVED_EMOJI -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isCurrencyExpression, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls, replaceEmojis, restoreEmojis, splitCurrencyExpression
-
Field Details
-
INSTANCE
-
-
Constructor Details
-
CatalanWordTokenizer
public CatalanWordTokenizer()
-
-
Method Details
-
tokenize
- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classWordTokenizer- Parameters:
text- Text to tokenize- Returns:
- List of tokens. Note: a special string xxCA_APOSxx is used to replace apostrophes, and xxCA_HYPHENxx to replace hyphens.
-