|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.annolab.tt4j.TreeTaggerWrapper<O>
O - the token type.public class TreeTaggerWrapper<O>
Main TreeTagger wrapper class. One TreeTagger process will be created and
maintained for each instance of this class. The associated process will be
terminated and restarted automatically if the model is changed
(setModel(String)). Otherwise the process remains running,
in the background once it is started which saves a lot of time. The process
remains dormant while not used and only consumes some memory, but no CPU
while it is not used.
During analysis, two threads are used to communicate with the TreeTagger. One process writes tokens to the TreeTagger process, while the other receives the analyzed tokens.
For easy integration into application, this class takes any object containing
token information and either uses its Object.toString() method or
an TokenAdapter set using setAdapter(TokenAdapter) to extract
the actual token. To receive the an analyzed token, set a custom
TokenHandler using setHandler(TokenHandler).
Per default the TreeTagger executable is searched for in the directories
indicated by the system property treetagger.home, the
environment variables TREETAGGER_HOME and TAGDIR
in this order. A full path to a model file optionally appended by a
: and the model encoding is expected by the setModel(String)
method.
For additional flexibility, register a custom ExecutableResolver
using setExecutableProvider(ExecutableResolver) or a custom
ModelResolver using setModelProvider(ModelResolver). Custom
providers may extract models and executable from archives or download them
from some location and temporarily or permanently install them in the file
system. A custom model resolver may also be used to resolve a language code
(e.g. en) to a particular model.
A simple illustration of how to use this class:
TreeTaggerWrapper tt = new TreeTaggerWrapper(); try { tt.setModel("/treetagger/models/english.par:iso8859-1"); tt.setHandler(new TokenHandler () { void token(String token, String pos, String lemma) { System.out.println(token+"\t"+pos+"\t"+lemma); } }); tt.process(asList(new String[] {"This", "is", "a", "test", "."})); } finally { tt.destroy(); }
| Field Summary | |
|---|---|
static int |
MAX_POSSIBLE_TOKEN_LENGTH
This is the maximal token size that TreeTagger on OS X supports (empirically determined). |
static boolean |
TRACE
|
| Constructor Summary | |
|---|---|
TreeTaggerWrapper()
|
|
| Method Summary | |
|---|---|
void |
destroy()
Stop the TreeTagger process and clean up the model and executable. |
protected void |
finalize()
|
TokenAdapter<O> |
getAdapter()
Get the current token adapter. |
String[] |
getArguments()
|
Double |
getEpsilon()
Get minimal tag frequency. |
ExecutableResolver |
getExecutableProvider()
Get the current executable resolver. |
TokenHandler<O> |
getHandler()
Get the current token handler. |
boolean |
getHyphenHeuristics()
Get hyphen heuristics mode setting. |
int |
getMaximumTokenLength()
Get the maximum number of bytes allowed in a token. |
Model |
getModel()
Get the currently set model. |
ModelResolver |
getModelResolver()
Get the current model resolver. |
boolean |
getPerformanceMode()
Get performance mode state. |
PlatformDetector |
getPlatformDetector()
Get platform information. |
Double |
getProbabilityThreshold()
|
int |
getRestartCount()
Get the number of times a TreeTagger process was started. |
String |
getStatus()
|
boolean |
isStrictMode()
Get the strict mode state. |
void |
process(Collection<O> aTokenList)
Process the given list of token objects. |
void |
process(O[] aTokenList)
Process the given array of token objects. |
protected Collection<O> |
removeProblematicTokens(Collection<O> tokenList)
Filter out tokens that cause problems when communicating with the TreeTagger process. |
void |
setAdapter(TokenAdapter<O> aAdapter)
Set a TokenAdapter used to extract the token string from
a token objects passed to process(Collection). |
void |
setArguments(String[] aArgs)
Set the arguments that are passed to the TreeTagger executable. |
void |
setEpsilon(Double aEpsilon)
Set minimal tag frequency to epsilon |
void |
setExecutableProvider(ExecutableResolver aExeProvider)
Set a custom executable resolver. |
void |
setHandler(TokenHandler<O> aHandler)
Set a TokenHandler to receive the analyzed tokens. |
void |
setHyphenHeuristics(boolean hyphenHeuristics)
Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words. |
void |
setMaximumTokenLength(int maximumTokenLength)
Set the maximal number of characters allowed in a token. |
void |
setModel(String modelName)
Load the model with the given name. |
void |
setModelProvider(ModelResolver aModelProvider)
Set a custom model resolver. |
void |
setPerformanceMode(boolean performanceMode)
Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). |
void |
setPlatformDetector(PlatformDetector aPlatform)
Set platform information. |
void |
setProbabilityThreshold(Double aThreshold)
Print all tags of a word with a probability higher than X times the largest probability. |
void |
setStrictMode(boolean strictMode)
Set the strict mode. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static boolean TRACE
public static final int MAX_POSSIBLE_TOKEN_LENGTH
| Constructor Detail |
|---|
public TreeTaggerWrapper()
| Method Detail |
|---|
public void setPerformanceMode(boolean performanceMode)
performanceMode - on/off.public boolean getPerformanceMode()
public void setMaximumTokenLength(int maximumTokenLength)
MAX_POSSIBLE_TOKEN_LENGTH and the length set is automatically
capped to that number. Note that this is the size in byte, not the size in characters.
maximumTokenLength - the maximal number of bytes allowed in a token.public int getMaximumTokenLength()
public void setStrictMode(boolean strictMode)
IllegalArgumentException is thrown when the
token sent to TreeTagger and the token returned from it are not equal. Since the TreeTagger
returns "?" for characters it does not know, the question mark is interpreted as a wild
card when testing for equality.
strictMode - on/off.public boolean isStrictMode()
public void setArguments(String[] aArgs)
aArgs - the arguments.public String[] getArguments()
public void setEpsilon(Double aEpsilon)
epsilon
aEpsilon - epsilon.public Double getEpsilon()
public Double getProbabilityThreshold()
public void setProbabilityThreshold(Double aThreshold)
null or to a negative value disables the output of probabilities.
Per default this is disabled.
aProbabilityThreshold - threshold X.public void setHyphenHeuristics(boolean hyphenHeuristics)
hyphenHeuristics - use hyphen heuristics.public boolean getHyphenHeuristics()
public void setModelProvider(ModelResolver aModelProvider)
aModelProvider - a model resolver.public ModelResolver getModelResolver()
aModelProvider - a model resolver.public void setExecutableProvider(ExecutableResolver aExeProvider)
aExeProvider - a executable resolver.public ExecutableResolver getExecutableProvider()
public void setHandler(TokenHandler<O> aHandler)
TokenHandler to receive the analyzed tokens.
aHandler - a token handler.public TokenHandler<O> getHandler()
public void setAdapter(TokenAdapter<O> aAdapter)
TokenAdapter used to extract the token string from
a token objects passed to process(Collection). If no adapter
is set, the Object.toString() method is used.
aAdapter - the adapter.public TokenAdapter<O> getAdapter()
public void setPlatformDetector(PlatformDetector aPlatform)
aPlatform - the platform information.public PlatformDetector getPlatformDetector()
public void setModel(String modelName)
throws IOException
modelName - the name of the model.
IOException - if the model can not be found.public Model getModel()
public void destroy()
protected void finalize()
throws Throwable
finalize in class ObjectThrowable
public void process(O[] aTokenList)
throws IOException,
TreeTaggerException
aTokens - the token objects.
IOException - if there is a problem providing the model or executable.
TreeTaggerException - if there is a problem communication with TreeTagger.
public void process(Collection<O> aTokenList)
throws IOException,
TreeTaggerException
aTokens - the token objects.
IOException - if there is a problem providing the model or executable.
TreeTaggerException - if there is a problem communication with TreeTagger.
protected Collection<O> removeProblematicTokens(Collection<O> tokenList)
throws UnsupportedEncodingException
tokenList - the original list of tokens.
UnsupportedEncodingExceptionpublic String getStatus()
public int getRestartCount()
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||