org.annolab.tt4j
Class TreeTaggerWrapper<O>

java.lang.Object
  extended by org.annolab.tt4j.TreeTaggerWrapper<O>
Type Parameters:
O - the token type.

public class TreeTaggerWrapper<O>
extends Object

Main TreeTagger wrapper class. One TreeTagger process will be created and maintained for each instance of this class. The associated process will be terminated and restarted automatically if the model is changed (setModel(String)). Otherwise the process remains running, in the background once it is started which saves a lot of time. The process remains dormant while not used and only consumes some memory, but no CPU while it is not used.

During analysis, two threads are used to communicate with the TreeTagger. One process writes tokens to the TreeTagger process, while the other receives the analyzed tokens.

For easy integration into application, this class takes any object containing token information and either uses its Object.toString() method or an TokenAdapter set using setAdapter(TokenAdapter) to extract the actual token. To receive the an analyzed token, set a custom TokenHandler using setHandler(TokenHandler).

Per default the TreeTagger executable is searched for in the directories indicated by the system property treetagger.home, the environment variables TREETAGGER_HOME and TAGDIR in this order. A full path to a model file optionally appended by a : and the model encoding is expected by the setModel(String) method.

For additional flexibility, register a custom ExecutableResolver using setExecutableProvider(ExecutableResolver) or a custom ModelResolver using setModelProvider(ModelResolver). Custom providers may extract models and executable from archives or download them from some location and temporarily or permanently install them in the file system. A custom model resolver may also be used to resolve a language code (e.g. en) to a particular model.

A simple illustration of how to use this class:

 TreeTaggerWrapper tt = new TreeTaggerWrapper();
 try {
     tt.setModel("/treetagger/models/english.par:iso8859-1");
     tt.setHandler(new TokenHandler() {
         void token(String token, String pos, String lemma) {
             System.out.println(token+"\t"+pos+"\t"+lemma);
         }
     });
     tt.process(asList(new String[] {"This", "is", "a", "test", "."}));
 }
 finally {
     tt.destroy();
 }
 

Author:
Richard Eckart de Castilho

Field Summary
static int MAX_POSSIBLE_TOKEN_LENGTH
          This is the maximal token size that TreeTagger on OS X supports (empirically determined).
static boolean TRACE
           
 
Constructor Summary
TreeTaggerWrapper()
           
 
Method Summary
 void destroy()
          Stop the TreeTagger process and clean up the model and executable.
protected  void finalize()
           
 TokenAdapter<O> getAdapter()
          Get the current token adapter.
 String[] getArguments()
           
 Double getEpsilon()
          Get minimal tag frequency.
 ExecutableResolver getExecutableProvider()
          Get the current executable resolver.
 TokenHandler<O> getHandler()
          Get the current token handler.
 boolean getHyphenHeuristics()
          Get hyphen heuristics mode setting.
 int getMaximumTokenLength()
          Get the maximum number of bytes allowed in a token.
 Model getModel()
          Get the currently set model.
 ModelResolver getModelResolver()
          Get the current model resolver.
 boolean getPerformanceMode()
          Get performance mode state.
 PlatformDetector getPlatformDetector()
          Get platform information.
 Double getProbabilityThreshold()
           
 int getRestartCount()
          Get the number of times a TreeTagger process was started.
 String getStatus()
           
 boolean isStrictMode()
          Get the strict mode state.
 void process(Collection<O> aTokenList)
          Process the given list of token objects.
 void process(O[] aTokenList)
          Process the given array of token objects.
protected  Collection<O> removeProblematicTokens(Collection<O> tokenList)
          Filter out tokens that cause problems when communicating with the TreeTagger process.
 void setAdapter(TokenAdapter<O> aAdapter)
          Set a TokenAdapter used to extract the token string from a token objects passed to process(Collection).
 void setArguments(String[] aArgs)
          Set the arguments that are passed to the TreeTagger executable.
 void setEpsilon(Double aEpsilon)
          Set minimal tag frequency to epsilon
 void setExecutableProvider(ExecutableResolver aExeProvider)
          Set a custom executable resolver.
 void setHandler(TokenHandler<O> aHandler)
          Set a TokenHandler to receive the analyzed tokens.
 void setHyphenHeuristics(boolean hyphenHeuristics)
          Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words.
 void setMaximumTokenLength(int maximumTokenLength)
          Set the maximal number of characters allowed in a token.
 void setModel(String modelName)
          Load the model with the given name.
 void setModelProvider(ModelResolver aModelProvider)
          Set a custom model resolver.
 void setPerformanceMode(boolean performanceMode)
          Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed).
 void setPlatformDetector(PlatformDetector aPlatform)
          Set platform information.
 void setProbabilityThreshold(Double aThreshold)
          Print all tags of a word with a probability higher than X times the largest probability.
 void setStrictMode(boolean strictMode)
          Set the strict mode.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TRACE

public static boolean TRACE

MAX_POSSIBLE_TOKEN_LENGTH

public static final int MAX_POSSIBLE_TOKEN_LENGTH
This is the maximal token size that TreeTagger on OS X supports (empirically determined).

See Also:
Constant Field Values
Constructor Detail

TreeTaggerWrapper

public TreeTaggerWrapper()
Method Detail

setPerformanceMode

public void setPerformanceMode(boolean performanceMode)
Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Parameters:
performanceMode - on/off.

getPerformanceMode

public boolean getPerformanceMode()
Get performance mode state.

Returns:
performance mode state.

setMaximumTokenLength

public void setMaximumTokenLength(int maximumTokenLength)
Set the maximal number of characters allowed in a token. The maximal supported token length is determined by MAX_POSSIBLE_TOKEN_LENGTH and the length set is automatically capped to that number. Note that this is the size in byte, not the size in characters.

Parameters:
maximumTokenLength - the maximal number of bytes allowed in a token.

getMaximumTokenLength

public int getMaximumTokenLength()
Get the maximum number of bytes allowed in a token.

Returns:
the maximum number of bytes allowed in a token.

setStrictMode

public void setStrictMode(boolean strictMode)
Set the strict mode. In this mode an IllegalArgumentException is thrown when the token sent to TreeTagger and the token returned from it are not equal. Since the TreeTagger returns "?" for characters it does not know, the question mark is interpreted as a wild card when testing for equality.

Parameters:
strictMode - on/off.

isStrictMode

public boolean isStrictMode()
Get the strict mode state.

Returns:
strict mode state.

setArguments

public void setArguments(String[] aArgs)
Set the arguments that are passed to the TreeTagger executable. A call to this method will cause a running TreeTagger process to be shut down and restarted with the new arguments. Using this method can cause TT4J to not work any longer. TTJ4 expects that TreeTagger prints a set of line each containing three tokens separated by spaces.

Parameters:
aArgs - the arguments.

getArguments

public String[] getArguments()

setEpsilon

public void setEpsilon(Double aEpsilon)
Set minimal tag frequency to epsilon

Parameters:
aEpsilon - epsilon.

getEpsilon

public Double getEpsilon()
Get minimal tag frequency.

Returns:
epsilon.

getProbabilityThreshold

public Double getProbabilityThreshold()

setProbabilityThreshold

public void setProbabilityThreshold(Double aThreshold)
Print all tags of a word with a probability higher than X times the largest probability. Setting this to null or to a negative value disables the output of probabilities. Per default this is disabled.

Parameters:
aProbabilityThreshold - threshold X.

setHyphenHeuristics

public void setHyphenHeuristics(boolean hyphenHeuristics)
Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words.

Parameters:
hyphenHeuristics - use hyphen heuristics.

getHyphenHeuristics

public boolean getHyphenHeuristics()
Get hyphen heuristics mode setting.

Returns:
whether to use hyphen heuristics

setModelProvider

public void setModelProvider(ModelResolver aModelProvider)
Set a custom model resolver.

Parameters:
aModelProvider - a model resolver.

getModelResolver

public ModelResolver getModelResolver()
Get the current model resolver.

Parameters:
aModelProvider - a model resolver.

setExecutableProvider

public void setExecutableProvider(ExecutableResolver aExeProvider)
Set a custom executable resolver.

Parameters:
aExeProvider - a executable resolver.

getExecutableProvider

public ExecutableResolver getExecutableProvider()
Get the current executable resolver.

Returns:
the current executable resolver.

setHandler

public void setHandler(TokenHandler<O> aHandler)
Set a TokenHandler to receive the analyzed tokens.

Parameters:
aHandler - a token handler.

getHandler

public TokenHandler<O> getHandler()
Get the current token handler.

Returns:
current token handler.

setAdapter

public void setAdapter(TokenAdapter<O> aAdapter)
Set a TokenAdapter used to extract the token string from a token objects passed to process(Collection). If no adapter is set, the Object.toString() method is used.

Parameters:
aAdapter - the adapter.

getAdapter

public TokenAdapter<O> getAdapter()
Get the current token adapter.

Returns:
the current token adapter.

setPlatformDetector

public void setPlatformDetector(PlatformDetector aPlatform)
Set platform information. Also sets the platform information in the model resolver and the executable resolver.

Parameters:
aPlatform - the platform information.

getPlatformDetector

public PlatformDetector getPlatformDetector()
Get platform information.

Returns:
the platform information.

setModel

public void setModel(String modelName)
              throws IOException
Load the model with the given name.

Parameters:
modelName - the name of the model.
Throws:
IOException - if the model can not be found.

getModel

public Model getModel()
Get the currently set model.

Returns:
the current model.

destroy

public void destroy()
Stop the TreeTagger process and clean up the model and executable.


finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

process

public void process(O[] aTokenList)
             throws IOException,
                    TreeTaggerException
Process the given array of token objects.

Parameters:
aTokens - the token objects.
Throws:
IOException - if there is a problem providing the model or executable.
TreeTaggerException - if there is a problem communication with TreeTagger.

process

public void process(Collection<O> aTokenList)
             throws IOException,
                    TreeTaggerException
Process the given list of token objects.

Parameters:
aTokens - the token objects.
Throws:
IOException - if there is a problem providing the model or executable.
TreeTaggerException - if there is a problem communication with TreeTagger.

removeProblematicTokens

protected Collection<O> removeProblematicTokens(Collection<O> tokenList)
                                         throws UnsupportedEncodingException
Filter out tokens that cause problems when communicating with the TreeTagger process.

Parameters:
tokenList - the original list of tokens.
Returns:
the filtered list of tokens.
Throws:
UnsupportedEncodingException

getStatus

public String getStatus()

getRestartCount

public int getRestartCount()
Get the number of times a TreeTagger process was started.

Returns:
the number of times a TreeTagger process was started.


Copyright © 2012. All Rights Reserved.