O - the token type.public class TreeTaggerWrapper<O>
extends java.lang.Object
setModel(String)). Otherwise the process remains running,
in the background once it is started which saves a lot of time. The process
remains dormant while not used and only consumes some memory, but no CPU
while it is not used.
During analysis, two threads are used to communicate with the TreeTagger. One process writes tokens to the TreeTagger process, while the other receives the analyzed tokens.
For easy integration into application, this class takes any object containing
token information and either uses its Object.toString() method or
an TokenAdapter set using setAdapter(TokenAdapter) to extract
the actual token. To receive the an analyzed token, set a custom
TokenHandler using setHandler(TokenHandler).
Per default the TreeTagger executable is searched for in the directories
indicated by the system property treetagger.home, the
environment variables TREETAGGER_HOME and TAGDIR
in this order. A full path to a model file optionally appended by a
: and the model encoding is expected by the setModel(String)
method.
For additional flexibility, register a custom ExecutableResolver
using setExecutableProvider(ExecutableResolver) or a custom
ModelResolver using setModelProvider(ModelResolver). Custom
providers may extract models and executable from archives or download them
from some location and temporarily or permanently install them in the file
system. A custom model resolver may also be used to resolve a language code
(e.g. en) to a particular model.
A simple illustration of how to use this class:
TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
try {
tt.setModel("/treetagger/models/english.par:iso8859-1");
tt.setHandler(new TokenHandler<String>() {
void token(String token, String pos, String lemma) {
System.out.println(token+"\t"+pos+"\t"+lemma);
}
});
tt.process(asList(new String[] {"This", "is", "a", "test", "."}));
}
finally {
tt.destroy();
}
| Modifier and Type | Field and Description |
|---|---|
static int |
MAX_POSSIBLE_TOKEN_LENGTH
This is the maximal token size that TreeTagger on OS X supports (empirically determined).
|
static boolean |
TRACE |
| Constructor and Description |
|---|
TreeTaggerWrapper() |
| Modifier and Type | Method and Description |
|---|---|
void |
destroy()
Stop the TreeTagger process and clean up the model and executable.
|
protected void |
finalize() |
TokenAdapter<O> |
getAdapter()
Get the current token adapter.
|
java.lang.String[] |
getArguments() |
java.lang.Double |
getEpsilon()
Get minimal tag frequency.
|
ExecutableResolver |
getExecutableProvider()
Get the current executable resolver.
|
TokenHandler<O> |
getHandler()
Get the current token handler.
|
boolean |
getHyphenHeuristics()
Get hyphen heuristics mode setting.
|
int |
getMaximumTokenLength()
Get the maximum number of bytes allowed in a token.
|
Model |
getModel()
Get the currently set model.
|
ModelResolver |
getModelResolver()
Get the current model resolver.
|
boolean |
getPerformanceMode()
Get performance mode state.
|
PlatformDetector |
getPlatformDetector()
Get platform information.
|
java.lang.Double |
getProbabilityThreshold() |
int |
getRestartCount()
Get the number of times a TreeTagger process was started.
|
java.lang.String |
getStatus() |
boolean |
isStrictMode()
Get the strict mode state.
|
void |
process(java.util.Collection<O> aTokenList)
Process the given list of token objects.
|
void |
process(O[] aTokenList)
Process the given array of token objects.
|
protected java.util.Collection<O> |
removeProblematicTokens(java.util.Collection<O> tokenList)
Filter out tokens that cause problems when communicating with the TreeTagger process.
|
void |
setAdapter(TokenAdapter<O> aAdapter)
Set a
TokenAdapter used to extract the token string from
a token objects passed to process(Collection). |
void |
setArguments(java.lang.String[] aArgs)
Set the arguments that are passed to the TreeTagger executable.
|
void |
setEpsilon(java.lang.Double aEpsilon)
Set minimal tag frequency to
epsilon |
void |
setExecutableProvider(ExecutableResolver aExeProvider)
Set a custom executable resolver.
|
void |
setHandler(TokenHandler<O> aHandler)
Set a
TokenHandler to receive the analyzed tokens. |
void |
setHyphenHeuristics(boolean hyphenHeuristics)
Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words.
|
void |
setMaximumTokenLength(int maximumTokenLength)
Set the maximal number of characters allowed in a token.
|
void |
setModel(Model model)
Load the given model.
|
void |
setModel(java.lang.String modelName)
Load the model with the given name.
|
void |
setModelProvider(ModelResolver aModelProvider)
Set a custom model resolver.
|
void |
setPerformanceMode(boolean performanceMode)
Disable some sanity checks, e.g.
|
void |
setPlatformDetector(PlatformDetector aPlatform)
Set platform information.
|
void |
setProbabilityThreshold(java.lang.Double aThreshold)
Print all tags of a word with a probability higher than X times the largest probability.
|
void |
setStrictMode(boolean strictMode)
Set the strict mode.
|
public static boolean TRACE
public static final int MAX_POSSIBLE_TOKEN_LENGTH
public void setPerformanceMode(boolean performanceMode)
performanceMode - on/off.public boolean getPerformanceMode()
public void setMaximumTokenLength(int maximumTokenLength)
MAX_POSSIBLE_TOKEN_LENGTH and the length set is automatically
capped to that number. Note that this is the size in byte, not the size in characters.maximumTokenLength - the maximal number of bytes allowed in a token.public int getMaximumTokenLength()
public void setStrictMode(boolean strictMode)
IllegalArgumentException is thrown when the
token sent to TreeTagger and the token returned from it are not equal. Since the TreeTagger
returns "?" for characters it does not know, the question mark is interpreted as a wild
card when testing for equality.strictMode - on/off.public boolean isStrictMode()
public void setArguments(java.lang.String[] aArgs)
aArgs - the arguments.public java.lang.String[] getArguments()
public void setEpsilon(java.lang.Double aEpsilon)
epsilonaEpsilon - epsilon.public java.lang.Double getEpsilon()
public java.lang.Double getProbabilityThreshold()
public void setProbabilityThreshold(java.lang.Double aThreshold)
null or to a negative value disables the output of probabilities.
Per default this is disabled.aThreshold - threshold X.public void setHyphenHeuristics(boolean hyphenHeuristics)
hyphenHeuristics - use hyphen heuristics.public boolean getHyphenHeuristics()
public void setModelProvider(ModelResolver aModelProvider)
aModelProvider - a model resolver.public ModelResolver getModelResolver()
public void setExecutableProvider(ExecutableResolver aExeProvider)
aExeProvider - a executable resolver.public ExecutableResolver getExecutableProvider()
public void setHandler(TokenHandler<O> aHandler)
TokenHandler to receive the analyzed tokens.aHandler - a token handler.public TokenHandler<O> getHandler()
public void setAdapter(TokenAdapter<O> aAdapter)
TokenAdapter used to extract the token string from
a token objects passed to process(Collection). If no adapter
is set, the Object.toString() method is used.aAdapter - the adapter.public TokenAdapter<O> getAdapter()
public void setPlatformDetector(PlatformDetector aPlatform)
aPlatform - the platform information.public PlatformDetector getPlatformDetector()
public void setModel(java.lang.String modelName)
throws java.io.IOException
modelName - the name of the model.java.io.IOException - if the model can not be found.public void setModel(Model model) throws java.io.IOException
model - the model.java.io.IOException - if the model can not be found.public Model getModel()
public void destroy()
protected void finalize()
throws java.lang.Throwable
finalize in class java.lang.Objectjava.lang.Throwablepublic void process(O[] aTokenList) throws java.io.IOException, TreeTaggerException
aTokenList - the token objects.java.io.IOException - if there is a problem providing the model or executable.TreeTaggerException - if there is a problem communication with TreeTagger.public void process(java.util.Collection<O> aTokenList) throws java.io.IOException, TreeTaggerException
aTokenList - the token objects.java.io.IOException - if there is a problem providing the model or executable.TreeTaggerException - if there is a problem communication with TreeTagger.protected java.util.Collection<O> removeProblematicTokens(java.util.Collection<O> tokenList) throws java.io.UnsupportedEncodingException
tokenList - the original list of tokens.java.io.UnsupportedEncodingException - if the model specifies an unsupported encoding.public java.lang.String getStatus()
public int getRestartCount()
Copyright © 2014. All Rights Reserved.