Used at the output layer when we're only going to need some of the possible ouputs; it exposes the penultimate layer and then the Layer allows you to pass the results from that back in (caching it elsewhere) and only compute certain cells in the output layer (activationsFromPenultimateDot).
Implements batch normalization from http://arxiv.org/pdf/1502.03167v3.pdf Basically, each unit is shifted and rescaled per minibatch so that its activations have mean 0 and variance 1.
Used at the input layer to cache lookups and the result of applying the affine transform at the first layer of the network.
Used at the input layer to cache lookups and the result of applying the affine transform at the first layer of the network. This saves computation across repeated invocations of the neural network in the sentence.
Used at the input layer to cache lookups and
Used at the input layer to cache lookups and backprop into embeddings
Separate because I was having some issues...
A bit of a misnomer since this has been generalized to support linear functions as well...
Output embedding technique described in section 6 of http://www.eecs.berkeley.edu/~gdurrett/papers/durrett-klein-acl2015.pdf Basically learns a dictionary for the output as well as an affine transformation in order to produce the vector that gets combined with the input in the final bilinear product.
converter is used to map words into the word2vec vocabulary, which might include things like lowercasing, replacing numbers, changing -LRB-, etc.
converter is used to map words into the word2vec vocabulary, which might include things like lowercasing, replacing numbers, changing -LRB-, etc. See Word2Vec.convertWord
Implements batch normalization from http://arxiv.org/pdf/1502.03167v3.pdf Basically, each unit is shifted and rescaled per minibatch so that its activations have mean 0 and variance 1. This has been demonstrated to help training deep networks, but doesn't seem to help here.