Deploying and running an extractor

Text Analytics Module (TAM) files are created when a module that contains AQL files is built by the project builder. To analyze the text in InfoSphere® BigInsights™ cluster nodes, export and deploy the TAM files from your Text Analytics project and reference the TAM files from the InfoSphere BigInsights Jaql Text Analytics module.

About this task

For details about working with the Jaql Text Analytics module, writing Jaql programs, and running these programs on a cluster, see Text Analytics Module.

Procedure

  1. Export an extractor from the Eclipse workbench:
    1. Right-click the project that contains the modules, and select Export.
    2. In the Export wizard, select BigInsights > Text Analytics Extractor, and click Next. The Extractor Export Specification wizard page opens.
    3. Specify Export dependent modules to export the dependent modules for the selected module. In a modular project, you can select the modules in a project to be exported.
    4. Click Browse workspace or Browse file system to select the directory to which you want to export the dependent modules.
    5. Optional: Select Export to destination directory to export the selected modules to the selected directory or select Export to a jar or zip archive under the destination directory to bundle the selected module into a JAR or ZIP archive.
    6. Click Finish to export the modules.
  2. Deploy TAM files in one of two ways, depending on how you want to run the extractor. To deploy TAM files, complete one of the following tasks:
    Option Description
    To run the extractor by using a Jaql module Upload the TAM files to the distributed file system (DFS) by using InfoSphere BigInsights Console or the Hadoop command.
    To run the extractor by using an Oozie Workflow publish the TAM files by using the InfoSphere BigInsights Application Publish wizard.
  3. Run a deployed extractor:
    Option Description
    To run the extractor by using the SystemT Jaql module, use the annotateDocument() function and reference the uploaded TAM file by using the standard URL for a DFS file, for example: hdfs://server.company.com:9000/user/tams/extractor.tam.

    The following parameters are needed by the function annotateDocument():

    document
    The DFS location of the input document.
    moduleNames
    A JSON array of module names.
    modulePath
    An array of paths to directories or JAR or ZIP files where the TAM files that are referenced in the input modules are found.

    The following parameters are optional:

    spanOutputPolicy
    The type of output. There are three options:
    toJsonSpan
    The result contains only span values, in other words, the beginning and ending offsets of annotations.
    toJsonString
    The result contains only the annotated texts.
    toJsonRecord
    The result contains the span values, annotated texts, and the original annotated document.
    tokenizer
    The tokenizer to use. The tokenizer can be Standard, Multilingual, or a custom tokenizer.
    language
    The language of the input document.
    externalDictionaries
    A JSON record with key-value pairs. The key is the name of the external dictionary that is qualified with the module name (<moduleName>.<dictionaryName>) and the value is an array of strings, where each string in an entry of the dictionary.
    For example:
    												{ "Person.Dict1": ["entry1", "entry2",...], "Person.Dict2": ["entry1", "entry2",...] }
    											
    .
    externalTables
    A JSON record with key-value pairs. The key is the name of the external table that is qualified with the module name (<moduleName>.<tableName>) and the value is the URI of a CSV file that contains the table contents.
    externalViews
    A JSON record with key-value pairs. The key is the name of the external view, and the value is an array of records, where each record encapsulates a tuple of the view.
    outputViews
    A JSON array of names of the views to output, or null to produce all outputs. The output view names must be fully qualified by module name.
    To run the extractor as a workflow, such as Oozie, use the InfoSphere BigInsights Application Publish wizard to publish the TAM files. You can then use the InfoSphere BigInsights Console to deploy the workflow application and run it. The following information is required to run the extractor:
    Input Data
    The location of input data. It can be a file or a directory.
    Output View
    The output view to run. You can select ALL to run all of the output views.
    Output Directory
    The location to which the output result is written. Optionally, you can select the Advance options check box for more configurations.
    Input Format
    The format of the input data. A Text Analytics application can run only with data in these formats: text, CSV with header (the first row contains column names,) new-line delimited, and JSON. For details on each format, see Data collection formats.
    Delimiter Character
    The delimiter character that is used for CSV types. For other types, this field is ignored.
    Output Type
    The type of output. There are three options:
    Span
    The result contains only span values. The beginning and ending offsets of annotations.
    Text
    The result contains only the annotated texts.
    Span and Text
    The result contains span values, annotated texts, and the original annotated document.
    Language
    A two-character abbreviation that represents the language of the input document. The default value is en for English.

Results

The extractor produces information about external tables and dictionaries that is used by the extractor at deploy time.

The run time component of the extractor follows this scenario:
  1. The AQL module is fed into the Text Analytics Optimizer, which compiles an execution plan for the views in the extractor.
  2. This execution plan is fed into the run time component of the system. The Text Analytics run time component has a document-at-a-time execution model.
  3. The run time component receives a stream of documents, annotating each in turn and producing the relevant annotations as output tuples.
  4. For each document, the run-time component populates the Document relation with a single tuple that represents the fields of the document. The tuple has two fields: label (usually the name of the input document) and text (the textual content of the document).
  5. The run time component evaluates all views that are necessary to produce the output views. The contents of the output views become the outputs of the extractor for the current document.