morfologik.fsa
Class CFSA

java.lang.Object
  extended by morfologik.fsa.FSA
      extended by morfologik.fsa.CFSA
All Implemented Interfaces:
Iterable<ByteBuffer>

public final class CFSA
extends FSA

CFSA (Compact Finite State Automaton) binary format implementation. This is a slightly reorganized version of FSA5 offering smaller automata size at some (minor) performance penalty.

This automaton version is not supported nor produced by the original fsa package. Use CFSAEncoder to convert FSA5 automata to this compact representation.

The encoding of automaton body is as follows.

 ---- FSA header (standard)
 Byte                            Description 
       +-+-+-+-+-+-+-+-+\
     0 | | | | | | | | | +------ '\'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ 'f'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     2 | | | | | | | | | +------ 's'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     3 | | | | | | | | | +------ 'a'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     4 | | | | | | | | | +------ version (fixed 0xc5)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     5 | | | | | | | | | +------ filler character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     6 | | | | | | | | | +------ annot character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     7 |C|C|C|C|G|G|G|G| +------ C - node data size (ctl), G - address size (gotoLength)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
  8-32 | | | | | | | | | +------ labels mapped for type (1) of arc encoding. 
       : : : : : : : : : |
       +-+-+-+-+-+-+-+-+/
 
 ---- Start of a node; only if automaton was compiled with NUMBERS option.
 
 Byte
        +-+-+-+-+-+-+-+-+\
      0 | | | | | | | | | \  LSB
        +-+-+-+-+-+-+-+-+  +
      1 | | | | | | | | |  |      number of strings recognized
        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
        : : : : : : : : :  |      from this node.
        +-+-+-+-+-+-+-+-+  +
  ctl-1 | | | | | | | | | /  MSB
        +-+-+-+-+-+-+-+-+/
        
 ---- A vector of node's arcs. Conditional format, depending on flags.
 
 1) NEXT bit set, mapped arc label. 
 
                +--------------- arc's label mapped in M bits if M's field value > 0
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
         _______| | | +--------- the arc is final
        /       | | | |
       +-+-+-+-+-+-+-+-+\
     0 |M|M|M|M|M|1|L|F| +------ flags + (M) index of the mapped label.
       +-+-+-+-+-+-+-+-+/
 
 2) NEXT bit set, label separate.
 
                +--------------- arc's label stored separately (M's field is zero).
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
                | | | +--------- the arc is final
                | | | |
       +-+-+-+-+-+-+-+-+\
     0 |0|0|0|0|0|1|L|F| +------ flags
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
 
 3) NEXT bit not set. Full arc.
 
                  +------------- node pointed to is next
                  | +----------- the last arc of the node
                  | | +--------- the arc is final
                  | | |
       +-+-+-+-+-+-+-+-+\
     0 |A|A|A|A|A|N|L|F| +------ flags + (A) address field, lower bits
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
       : : : : : : : : :       
       +-+-+-+-+-+-+-+-+\
 gtl-1 |A|A|A|A|A|A|A|A| +------ address, continuation (MSB)
       +-+-+-+-+-+-+-+-+/
 


Field Summary
 byte[] arcs
          An array of bytes with the internal representation of the automaton.
static int BIT_FINAL_ARC
          Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.
static int BIT_LAST_ARC
          Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.
static int BIT_TARGET_NEXT
          Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).
 int gtl
          Number of bytes each address takes in full, expanded form (goto length).
 byte[] labelMapping
          Label mapping for arcs of type (1) (see class documentation).
 int nodeDataLength
          The length of the node header structure (if the automaton was compiled with NUMBERS option).
static byte VERSION
          Automaton header version value.
 
Constructor Summary
CFSA(InputStream fsaStream)
          Creates a new automaton reading it from a file in FSA format, version 5.
 
Method Summary
 int getArc(int node, byte label)
          Returns the identifier of an arc leaving node and labeled with label.
 byte getArcLabel(int arc)
          Return the label associated with a given arc.
protected  int getDestinationNodeOffset(int arc)
          Returns the address of the node pointed to by this arc.
 int getEndNode(int arc)
          Return the end node pointed to by a given arc.
 int getFirstArc(int node)
          Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.
 Set<FSAFlags> getFlags()
          Returns a set of flags for this FSA instance.
 int getNextArc(int arc)
          Returns the identifier of the next arc after arc and leaving node.
 int getRootNode()
          Returns the start node of this automaton.
 boolean isArcFinal(int arc)
          Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.
 boolean isArcLast(int arc)
          Returns true if this arc has NEXT bit set.
 boolean isArcTerminal(int arc)
          Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception).
 boolean isLabelCompressed(int arc)
          Returns true if the label is compressed inside flags byte.
 boolean isNextSet(int arc)
           
 
Methods inherited from class morfologik.fsa.FSA
getInstance, getTraversalHelper, iterator
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION

public static final byte VERSION
Automaton header version value.

See Also:
Constant Field Values

BIT_FINAL_ARC

public static final int BIT_FINAL_ARC
Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.

See Also:
Constant Field Values

BIT_LAST_ARC

public static final int BIT_LAST_ARC
Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.

See Also:
Constant Field Values

BIT_TARGET_NEXT

public static final int BIT_TARGET_NEXT
Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).

See Also:
Constant Field Values

arcs

public byte[] arcs
An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.


nodeDataLength

public final int nodeDataLength
The length of the node header structure (if the automaton was compiled with NUMBERS option). Otherwise zero.


gtl

public final int gtl
Number of bytes each address takes in full, expanded form (goto length).


labelMapping

public final byte[] labelMapping
Label mapping for arcs of type (1) (see class documentation). The array is indexed by mapped label's value and contains the original label.

Constructor Detail

CFSA

public CFSA(InputStream fsaStream)
     throws IOException
Creates a new automaton reading it from a file in FSA format, version 5.

Throws:
IOException
Method Detail

getRootNode

public int getRootNode()
Returns the start node of this automaton. May return 0 if the start node is also an end node.

Specified by:
getRootNode in class FSA

getFirstArc

public final int getFirstArc(int node)
Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.

Specified by:
getFirstArc in class FSA

getNextArc

public final int getNextArc(int arc)
Returns the identifier of the next arc after arc and leaving node. Zero is returned if no more arcs are available for the node.

Specified by:
getNextArc in class FSA

getArc

public int getArc(int node,
                  byte label)
Returns the identifier of an arc leaving node and labeled with label. An identifier equal to 0 means the node has no outgoing arc labeled label.

Specified by:
getArc in class FSA

getEndNode

public int getEndNode(int arc)
Return the end node pointed to by a given arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.

Specified by:
getEndNode in class FSA

getArcLabel

public byte getArcLabel(int arc)
Return the label associated with a given arc.

Specified by:
getArcLabel in class FSA

isArcFinal

public boolean isArcFinal(int arc)
Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.

Specified by:
isArcFinal in class FSA

isArcTerminal

public boolean isArcTerminal(int arc)
Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception). Implies FSA.isArcFinal(int).

Specified by:
isArcTerminal in class FSA

isArcLast

public boolean isArcLast(int arc)
Returns true if this arc has NEXT bit set.

See Also:
BIT_LAST_ARC

isNextSet

public boolean isNextSet(int arc)
See Also:
BIT_TARGET_NEXT

isLabelCompressed

public boolean isLabelCompressed(int arc)
Returns true if the label is compressed inside flags byte.


getFlags

public Set<FSAFlags> getFlags()
Returns a set of flags for this FSA instance.

For this automaton version, an additional FSAFlags.NUMBERS flag may be set to indicate the automaton contains extra fields for each node.

Specified by:
getFlags in class FSA

getDestinationNodeOffset

protected final int getDestinationNodeOffset(int arc)
Returns the address of the node pointed to by this arc.



Copyright © 2010. All Rights Reserved.