edu.washington.cs.knowitall.sequence
Class Encoder

java.lang.Object
  extended by edu.washington.cs.knowitall.sequence.Encoder

public class Encoder
extends Object

This class represents a table mapping tuples of strings to integer values. It is used by LayeredTokenPattern for matching patterns against LayeredSequence objects.

The core of this class is a mapping from string tuples of length n to integers 0 <= i < MAX_SIZE. The mapping is defined by a list of n sets of String symbols S_1, ..., S_n, and a special symbol UNK. The mapping assigns an integer value to each tuple (x_1, ..., x_n), where x_i is either in S_i or is the symbol UNK. For example, if n = 2 and S_1 = S_2 = {0,1}, then a possible mapping would be (0,0) => 0, (0,1) => 1, (0, UNK) => 2, (1,0) => 3, (1,1) => 4, (1,UNK) => 5, (UNK,0) => 6, (UNK,1) => 7, (UNK,UNK) => 8.

Given a String tuple (x_1, ..., x_n), it is mapped to an integer value as follows. First, it is mapped to an intermediate tuple (y_1, ..., y_n), where y_i = x_i if x_i is in S_i, otherwise y_i = UNK. Then the value of (y_1, ..., y_n) according to the mapping is returned. This procedure is implemented in the method encode(String[]), which represents tuples as String arrays.

There is no guarantee on the actual integer values assigned to each tuple. The mapping cannot be larger than 2^16. This means that the product (|S_1|+1) * (|S_2|+1) * ... * (|S_n| + 1) must be less than or equal to 2^16.

Author:
afader

Field Summary
static int MAX_SIZE
          The maximum encoding size.
static String UNK
          The "unknown" symbol.
 
Constructor Summary
Encoder(List<Set<String>> symbols)
          Constructs a new encoding table using the given symbol sets.
 
Method Summary
 char encode(String[] tuple)
          Encodes the given tuple (represented as a String array) to its integer value, represented as a char.
 char[] encodeClass(int index, String value)
          Encodes a "class" of tuples that all have the symbol value in the given layer index.
 int size()
           
 int tableSize()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_SIZE

public static final int MAX_SIZE
The maximum encoding size.

See Also:
Constant Field Values

UNK

public static final String UNK
The "unknown" symbol.

See Also:
Constant Field Values
Constructor Detail

Encoder

public Encoder(List<Set<String>> symbols)
        throws SequenceException
Constructs a new encoding table using the given symbol sets. These symbol sets should not contain the unknown symbol UNK.

Parameters:
symbols -
Throws:
SequenceException - if the symbol sets result in an encoding table larger than MAX_SIZE.
Method Detail

size

public int size()
Returns:
the tuple length of this encoding table

tableSize

public int tableSize()
Returns:
the number of keys in this encoding table

encode

public char encode(String[] tuple)
            throws SequenceException
Encodes the given tuple (represented as a String array) to its integer value, represented as a char.

Parameters:
tuple -
Returns:
the integer value of the array, represented as a char
Throws:
SequenceException - if unable to encode the tuple

encodeClass

public char[] encodeClass(int index,
                          String value)
                   throws SequenceException
Encodes a "class" of tuples that all have the symbol value in the given layer index. Using the example from the class description, if the encoding table contains the mappings (0,0) => 0, (0,1) => 1, (0,UNK) => 2, ..., then calling this method with layerIndex = 0 and value = 1 will return the encodings of (1,0), (1,1), and (1,UNK) as an array.

Parameters:
index - the position in the tuple (defined by the order of sets passed to the constructor)
value -
Returns:
the encoding as an array
Throws:
SequenceException - if the index is out of bounds, or if any of the resulting tuples cannot be encoded


Copyright © 2010-2012 University of Washington CSE. All Rights Reserved.