Creating and running launch configurations

You can use a launch configuration to run an AQL extractor against a data collection. If you want to run the extractor against more than one collection, you can create a launch configuration for each collection and save them for future use.

Before you begin

Procedure

  1. Right-click a Text Analytics project, and click Run As > Run Configurations.
  2. In the Run Configurations panel, right-click Text Analytics > New in the navigation pane. A new Text Analytics configuration is created and selected in the navigation pane, and the Main tab opens in the content pane. The information in the Project Preferences tab is completed by using the properties that were inherited from the project.
    1. If the project is a modular AQL project, select the list of modules to be included.
    2. Select the language of the data collection.
    3. Click Browse Workspace or Browse File System to select the location of the data collection. If one or more selected modules require external view data, choose a JSON file of suitable format as explained in Data collection formats.
      CAUTION:
      In the file selection window, if you select the Show All Files check box, ensure that you select a file that conforms to one of the supported formats. If you select a file that has an unsupported format, it can return unexpected results.
      Valid formats for data collections are provided in Data collection formats.
  3. The Text Analytics run configuration has two more tabs: External Tables and External Dictionaries. If the project is a modular AQL project and contains external dictionaries or external tables, they are listed in these tabs. You can pass data to the external artifacts that are declared in the extractor. For more information, see the External tables and External dictionaries tabs in the Text Analytics run configuration.

Results

How the Text Analytics system stores extraction results in the file system

After an extractor runs, one result file is generated for each input document. All result files have the .strf file extension. Result files are stored in the file system so that results can be displayed in the Text Analytics tool.

Input files that do not have annotations are not serialized and not considered for processing in the InfoSphere BigInsights Tools for Eclipse.

If the input document is a file with no special characters within a directory, then the name of the result file (.strf) is the same as the input file. However, for some data collections (such as a .ZIP file, a directory with a subdirectory, or a del file with internal labels that contain special characters), the name of the result file cannot be directly mapped to the input file name. This occurs because the label can contain certain special characters that are not allowed in a file name (such as '/', '?','%','*',':','|','<','>','\','"' ). As a result, the Text Analytics system flattens the hierarchy (in the case of a .zip directory or a subdirectory), or it normalizes the file name with special characters (in the case of del files that contain input document labels as URLs, for example, by replacing the path separators and special characters with a special character '~'). In rare situations, two input documents can have labels with the same result file name, differentiated by a version number, for example MyDoc.strf and MyDoc(1).strf. Results are also saved in a result-<system-timestamp> subdirectory in the result directory for the project. See Comparing the results of one extractor run to the results of another run to compare the performance of the results. See Evaluating the quality of an extractor by combining the labeled data collection with the Annotation Difference Viewer to evaluate the differences in data collection between results.

Tip: To run a recently launched configuration again, select the configuration from the Run toolbar button menu.