public class Tokenizer extends Object
Tokenizer supports
abbreviations for english, french, italian and german language. If no
language is set, all available abbreviations will be used.| Modifier and Type | Field and Description |
|---|---|
protected static String |
F_CHAR |
protected static String |
P_CHAR |
| Constructor and Description |
|---|
Tokenizer()
Initializes a new TTokenizer object.
|
| Modifier and Type | Method and Description |
|---|---|
void |
addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
File abbreviationFile)
Adds the content of given file as a list of abbreviation to the internal
map corresponding to given language.
|
void |
addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
HashSet<String> abbreviations)
Adds the given list of abbreviation to the internal map corresponding to
given language.
|
void |
addClitics(com.neovisionaries.i18n.LanguageCode language,
Clitics clitics)
Adds the given clitics to the internal map corresponding to given
language.
|
void |
addClitics(com.neovisionaries.i18n.LanguageCode language,
File cliticsFile)
Adds the content of given file as a set of clitics to the internal map
corresponding to given language.
|
static com.neovisionaries.i18n.LanguageCode |
checkLanguage(String text)
Tries to detect language and returns ISO 639-2 language code
|
HashSet<String> |
getAbbreviations(com.neovisionaries.i18n.LanguageCode language)
Returns a list of abbreviations corresponding to the given language.
|
Clitics |
getClitics(com.neovisionaries.i18n.LanguageCode language)
Returns a list of abbreviations corresponding to the given language.
|
SDocumentGraph |
getDocumentGraph() |
static com.neovisionaries.i18n.LanguageCode |
mapISOLanguageCode(String language)
Maps the knallgrau
TextCategorizer language description codes to
ISO 639 codes. |
void |
setsDocumentGraph(SDocumentGraph sDocumentGraph) |
List<SToken> |
tokenize(STextualDS sTextualDSs)
Sets the
STextualDS to be tokenized. |
List<SToken> |
tokenize(STextualDS sTextualDSs,
com.neovisionaries.i18n.LanguageCode language)
Sets the
STextualDS to be tokenized and the language of the text. |
List<SToken> |
tokenize(STextualDS sTextualDS,
com.neovisionaries.i18n.LanguageCode language,
Integer startPos,
Integer endPos)
Sets the
STextualDS to be tokenized and the language of the text. |
List<String> |
tokenizeToString(String strInput,
com.neovisionaries.i18n.LanguageCode language)
The general task of this class is to tokenize a given text in the same
order as the tool TreeTagger will do.
|
List<SToken> |
tokenizeToToken(STextualDS sTextualDS,
com.neovisionaries.i18n.LanguageCode language,
Integer startPos,
Integer endPos)
The general task of this class is to tokenize a given text in the same
order as the tool TreeTagger will do.
|
protected static final String P_CHAR
protected static final String F_CHAR
public void setsDocumentGraph(SDocumentGraph sDocumentGraph)
public SDocumentGraph getDocumentGraph()
public List<SToken> tokenize(STextualDS sTextualDSs)
STextualDS to be tokenized. Its language will be
detected automatically if possible.sTextualDSs - public List<SToken> tokenize(STextualDS sTextualDSs, com.neovisionaries.i18n.LanguageCode language)
STextualDS to be tokenized and the language of the text.
If language is null, it will be detected automatically if possible.sTextualDSs - public List<SToken> tokenize(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, Integer startPos, Integer endPos)
STextualDS to be tokenized and the language of the text.
If language is null, it will be detected automatically if possible.sTextualDSs - STextualDS object containing the text to be tokenizedlanguage - language of text, if null, language will be detected
automaticallystartPos - start position, if text to be tokenized is subset (0 assumed
if set to null)startPos - end position, if text to be tokenized is subset (length of
text assumed if set to null)public static com.neovisionaries.i18n.LanguageCode checkLanguage(String text)
text - public static com.neovisionaries.i18n.LanguageCode mapISOLanguageCode(String language)
TextCategorizer language description codes to
ISO 639 codes.public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
HashSet<String> abbreviations)
language - abbreviations - public void addAbbreviation(com.neovisionaries.i18n.LanguageCode language,
File abbreviationFile)
language - abbreviations - public HashSet<String> getAbbreviations(com.neovisionaries.i18n.LanguageCode language)
language - public void addClitics(com.neovisionaries.i18n.LanguageCode language,
Clitics clitics)
language - clitics - public void addClitics(com.neovisionaries.i18n.LanguageCode language,
File cliticsFile)
The file must be structured so that the first line contains the regex for proclitics, and the second line the regex for enclitics, e.g.:
([dcjlmnstDCJLNMST]'|[Qq]u'|[Jj]usqu'|[Ll]orsqu')
(-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m'|-moi|-nous|-on|-toi|-tu|-t'|-vous|-en|-y|-ci|-là)
language - cliticsFile - public Clitics getClitics(com.neovisionaries.i18n.LanguageCode language)
language - public List<SToken> tokenizeToToken(STextualDS sTextualDS, com.neovisionaries.i18n.LanguageCode language, Integer startPos, Integer endPos)
SDocumentGraph already contains tokens, the tokens will be
preserved, if they overlap the same textual range as the new one.
Otherwise a SSpan is created covering corresponding to the
existing token. The span than overlaps all new tokens and contains all
annotations the old token did. In case, the span would overlaps the same
textual range as the old token did, no span is created.strInput - original textpublic List<String> tokenizeToString(String strInput, com.neovisionaries.i18n.LanguageCode language)
strInput - original textCopyright © 2009–2020 Humboldt-Universität zu Berlin, INRIA. All rights reserved.