Text Analytics Module (TAM) files
are created when a module that contains AQL files is built by the
project builder. To analyze the text in InfoSphere®
BigInsights™ cluster nodes, export and deploy the TAM files from your Text Analytics
project and reference the TAM files
from the InfoSphere BigInsights Jaql
Text Analytics module.
About this task
For details about working with the Jaql Text Analytics module,
writing Jaql programs, and running these programs on a cluster, see
Text
Analytics Module.
Procedure
- Export
an extractor from the Eclipse workbench:
- Right-click
the project that contains the modules, and select Export.
- In the Export wizard, select ,
and click Next. The Extractor Export Specification
wizard page opens.
- Specify Export dependent modules to export
the dependent modules for the selected module. In a modular
project, you can select the modules in a project to be exported.
- Click Browse workspace or Browse file system to select the
directory to which you want to export the dependent modules.
- Optional: Select
Export to destination directory
to export the selected modules to the selected directory or
select Export to a jar or zip
archive under the destination directory to bundle the selected
module into a JAR or ZIP archive.
- Click Finish to export the modules.
- Deploy
TAM files in one of two ways,
depending on how you want to run the extractor. To deploy TAM files, complete one of the following
tasks:
| Option |
Description |
| To run the
extractor by using a Jaql module |
Upload the TAM
files to the distributed file system (DFS) by using InfoSphere BigInsights Console or the
Hadoop command.
|
| To run the
extractor by using an Oozie Workflow |
publish the TAM files by using the InfoSphere BigInsights Application
Publish wizard.
|
- Run a
deployed extractor:
| Option |
Description |
| To run the
extractor by using the SystemT
Jaql module, use the annotateDocument()
function and reference the uploaded TAM
file by using the standard URL for a DFS file, for example: hdfs://server.company.com:9000/user/tams/extractor.tam.
|
The following parameters are needed by the function annotateDocument():
- document
- The DFS location of the input document.
- moduleNames
- A JSON array of module names.
- modulePath
-
An array of paths to directories or JAR
or ZIP files where the TAM files that are referenced in
the input modules are found.
The following parameters are optional:
- spanOutputPolicy
-
The type of output. There are three options:
- toJsonSpan
- The result contains only span values, in
other words, the beginning and ending offsets of
annotations.
- toJsonString
- The result contains only the annotated
texts.
- toJsonRecord
- The result contains the span values,
annotated texts, and the original annotated document.
- tokenizer
- The tokenizer to use. The tokenizer can
be Standard, Multilingual, or a custom tokenizer.
- language
- The language of the input document.
- externalDictionaries
-
A JSON record with key-value pairs. The key is the name of the
external dictionary that is qualified with the module name
(<moduleName>.<dictionaryName>) and the value is
an array of strings, where each string in an entry of the
dictionary.
For example:
{ "Person.Dict1": ["entry1", "entry2",...], "Person.Dict2": ["entry1", "entry2",...] }
.
- externalTables
- A JSON record with key-value pairs. The
key is the name of the external table that is qualified with
the module name (<moduleName>.<tableName>) and the
value is the URI of a CSV file that contains the table
contents.
- externalViews
- A JSON record with key-value pairs. The
key is the name of the external view, and the value is an
array of records, where each record encapsulates a tuple of
the view.
- outputViews
- A JSON array of names of the views to
output, or null to produce all outputs. The output view names
must be fully qualified by module name.
|
| To run the
extractor as a workflow, such as Oozie, use the InfoSphere BigInsights Application
Publish wizard to publish the TAM
files. You can then use the InfoSphere
BigInsights Console to deploy the workflow application and run
it. The following information is required to run the extractor:
|
- Input Data
- The location of input data. It can be a
file or a directory.
- Output View
-
The output view to run. You can select ALL to run all of the output
views.
- Output Directory
-
The location to which the output result is written.
Optionally, you can select the Advance
options check box for more configurations.
- Input Format
-
The format of the input data. A Text Analytics application can
run only with data in these formats: text, CSV with header
(the first row contains column names,) new-line delimited, and
JSON. For details on each format, see Data
collection formats.
- Delimiter Character
- The delimiter character that is used for
CSV types. For other types, this field is ignored.
- Output Type
-
The type of output. There are three options:
- Span
- The result contains only span values. The
beginning and ending offsets of annotations.
- Text
- The result contains only the annotated
texts.
- Span and Text
- The result contains span values,
annotated texts, and the original annotated document.
- Language
-
A two-character abbreviation that represents the language of
the input document. The default value is en for English.
|
Results
The extractor produces information about external
tables and dictionaries that is used by the extractor at deploy
time.
The run time component of the extractor follows this scenario:
- The AQL module is fed into the Text Analytics
Optimizer, which compiles an execution plan for the views in the
extractor.
- This execution plan is fed into the run time
component of the system. The Text Analytics run time component has
a document-at-a-time execution model.
- The run time component receives a stream of
documents, annotating each in turn and producing the relevant
annotations as output tuples.
- For each document, the run-time component
populates the Document relation with a single tuple that
represents the fields of the document. The tuple has two fields:
label (usually the name of the input document) and text (the
textual content of the document).
- The run time component evaluates all views that
are necessary to produce the output views. The contents of the
output views become the outputs of the extractor for the current
document.