Package edu.nyu.jet.tipster

The Tipster package provides the basic methods for recording information about documents.  It is loosely based on the 'Tipster Architecture' developed by R.Grishman as part of the Government-sponsored Tipster program.  The basic objects are Documents and Annotations;  a Document is a container for the text of the document, and a set of Annotations on the Document.

See: Description

Package edu.nyu.jet.tipster Description

The Tipster package provides the basic methods for recording information about documents.  It is loosely based on the 'Tipster Architecture' developed by R.Grishman as part of the Government-sponsored Tipster program.  The basic objects are Documents and Annotations;  a Document is a container for the text of the document, and a set of Annotations on the Document.

In the course of processing, the Jet system builds up a lot of information about the words and phrases in a Document:  simple things like parts-of-speech for individual words and type information (person/company/location) for names, as well as more complex things like phrases and clauses (with internal structure).  We want to have a single class of object for capturing all of this information and associating it with a Document.  The class we use for this purpose is the Annotation.  An Annotation is associated with a Span (substring) of the text of a Document.  The Annotation has a type and a set of features with values.  For example, an annotation can indicate that a portion of a document is a sentence, or is a token with a given part-of-speech.  More complex structures can be build by having Annotations which point to other annotations.

A Document is processed in a series of stages, such as tokenization, sentence splitting, dictionary look-up, pattern matching, etc.  Each stage uses the Annotations placed on the Document by previous stages, and adds its own Annotations to the Document.

Annotations provide a mark-up capability very similar to that of SGML or XML (although Annotations do not have to be nested the way SGML/XML mark-up it).  The Document class provides a method for converting selected Annotations on a Document to XML mark-up, and in the future will have a method for converting XML mark-up to Annotations.  In addition, the Document class provides a method for viewing a Document and highlighting selected annotations (this is very primitive at present).

Copyright © 2016 New York University. All rights reserved.