public class MaxEntNE extends TokenClassifier
MENameTagger to do token-level training and decoding.
Dictionary-based strategy: operating without a cut-off, so that every singleton word (and its context) becomes a separate feature, leads to a very large number of features, which makes training slow and, without smoothing, can produce poor performance. Instead, this tagger uses a cut-off of 4, so only words appearing 4 or more times in the training become separate features. In order that information from words appearing less than 4 times is used, it builds 3 dictionaries during training (WordType, WordTypeEvens, WordTypeOdds); one from all documents, one from even-numbered documents, and one from odd-numbered documents. This is done during one pass over the documents. In a second pass, we train the MaxEnt model, using the information from WordTypeEvens as a feature for odd documents, and the information from WordTypeOdds as a feature for even documents.
| Modifier and Type | Field and Description |
|---|---|
static double |
otherOffset |
static int |
pass
pass number during training.
|
| Constructor and Description |
|---|
MaxEntNE()
create a new maximum entropy tagger.
|
| Modifier and Type | Method and Description |
|---|---|
void |
createModel()
create a max ent model (at the end of training).
|
void |
load(BufferedReader reader)
load the information required for the MaxEntNE tagger from
reader. |
void |
load(String fileName)
load the information required for the MaxEntNE tagger from file
fileName. |
static void |
loadWordClusters(String wordClusterFile)
loads word clusters and builds map from word to cluster
|
void |
newDocument() |
void |
resetForTraining(String featureFile)
initializes the training process for the tagger.
|
String[] |
simpleDecoder(Document doc,
Annotation[] tokens)
assign the best tag for each token using a simple deterministic
left-to-right tagger (which may not find the most probable path).
|
void |
store(BufferedWriter writer)
write the information required for the MaxEntNE tagger to BufferedWriter
writer. |
void |
store(String fileName)
store the information required for the MaxEntNE tagger to file
fileName. |
void |
train(Document doc,
Annotation[] tokens,
String[] tags)
train the model on a sequence of words from Document doc.
|
String[] |
viterbi(Document doc,
Annotation[] tokens)
assign the best tag for each token using a Viterbi decoder.
|
getLocalMargin, getMargin, getPathProbability, nextBest, recordMarginpublic static int pass
public static double otherOffset
public void resetForTraining(String featureFile)
featureFile - the file into which the features will be writtenpublic void newDocument()
newDocument in class TokenClassifierpublic void train(Document doc, Annotation[] tokens, String[] tags)
train in class TokenClassifierdoc - the document containing the word sequencetokens - the token annotations for these wordstags - the token-level name tags for these wordspublic void createModel()
createModel in class TokenClassifierpublic void store(String fileName)
fileName. This information is the table of
types for each word, and the parameters of the maximum entropy model.store in class TokenClassifierpublic void store(BufferedWriter writer)
writer. This information is the table of
types for each word, and the parameters of the maximum entropy model.public void load(String fileName)
fileName. This information is the table of
types for each word, and the parameters of the maximum entropy model.load in class TokenClassifierpublic void load(BufferedReader reader)
reader. This information is the table of
types for each word, and the parameters of the maximum entropy model.public String[] simpleDecoder(Document doc, Annotation[] tokens)
public String[] viterbi(Document doc, Annotation[] tokens)
viterbi in class TokenClassifierpublic static void loadWordClusters(String wordClusterFile) throws IOException
IOExceptionCopyright © 2016 New York University. All rights reserved.