Labeling the data

You start the extraction plan by labeling snippets of interest and associated clues. Snippets are words or phrases that you determine to be helpful to the goal of producing the information that you need.

About this task

The two primary views of the workflow, the Extraction Tasks and the Extraction Plan, provide the process and management of creating your extractor. These views are a part of the InfoSphere® BigInsights™ Text Analytics Workflow perspective. If you do not see these views, select each of them by clicking Window > Show View > Other > BigInsights > Extraction Tasks and Window > Show View > Other > BigInsights > Extraction Plan.

After you select the data from Step 1 of the Extractor Tasks wizard, open some of the data in the editor to begin mining it.

The labeling process is a way to provide clues and examples of text that are significant or that provide a pattern for what you want to accomplish with this extractor. Use the clues within and around the examples to support a positive or negative result. Labeling is an iterative process.

Note: When you import a project that was created before InfoSphere BigInsights Version 2.0, and open the Extraction Plan, the old extraction plan is converted to the current design automatically. The project itself remains in a non-modular state.

Procedure

  1. Highlight a document that is in the list in the Step 1 : Select Documents wizard, and click Open. The document opens in the AQL Editor pane. By reading the document, or searching for particular phrases, you create your list of examples and clues.
  2. Find a phrase that is meaningful to the goal and that offers an example of either a positive or negative result. For example, if your goal is to learn more about the financial growth of a company, you might search for the term revenue. When you find that term, highlight the term, right-click it, and click Add Example with New label.

    The root label and all sublabels contain four groups. These groups are based on the main steps in building extractors (basic features, candidate generation, and filter and consolidate) and the special group finals, which contains final output and export views, tables, and dictionaries. You are encouraged to use this grouping to develop extractors.

    Each group of a root label is linked with a module that is created together with the root label. The first time that you create a root label (also called a top-level label), four modules are created automatically:
    • <root_label_name>_BasicFeatures
    • <root_label_name>_CandidateGeneration
    • <root_label_name>_FilterConsolidate
    • <root_label_name>_Finals
    You can see these default modules by opening the Package Explorer and browsing to the <project_name>\textAnalytics\src\ directory. You can create your own modules, but it is recommended that you use these default modules to create your AQL scripts.
  3. In the Add New Label window, type a name for the label. If this label is a root label, then you can click Finish. If this label is a sublabel, type or select the parent label name from the listed hierarchy of labels. This tree view of labels filters the labels by the text that you enter in the Add New Label window so that you can more easily determine the correct parent for your label grouping.
  4. Optional: Create a root label that is independent of the documents that you are editing. Navigate to an empty space in the input document, right-click, and select New Label. The Create Label window opens.
    1. In the Enter the label field, type a label name.

      When you add this label separately from the document, all of the same structures are included in the project, such as <new_label>_BasicFeatures.

  5. For each label in the Extraction Plan view, right-click the label and do the following actions:
    Mark a label complete
    Marking a label is a workflow tool that can show when you are done adding examples for a label. It is a visual clue as to which labels might need more work and which labels are done.
    Create a new label
    You can add sublabels to any existing label to refine the groupings.
    Add, edit, or view a comment
    Document your decisions or process by adding comments to your labels. The comment is stored with the label and can be viewed later by right-clicking and selecting Edit/View comment.
    Add an AQL statement
    You can select a template of AQL statements by their level (Basic Features, Candidate Generation, Filter and Consolidate, Final).
    Go into a label
    Expand the label node, and see the hierarchy of its children or descendants.
  6. From the Extraction Plan, you can enhance an extractor with some utilities that are included in the menus.
    Generate regular expression from examples
    Examples are nested under the labels. Right-click one or more examples in a label, then click Generate Regular Expression. The Regular Expression Generator opens with the examples already included as a sample from which to generate a regular expression.
    Create dictionary from examples
    Right-click one or more examples in a label, and click Add to Dictionary. The Select Dictionaries window opens. Select a file to use as your dictionary or create a dictionary file by typing a valid file name in the Select a file field. The dictionary opens in the editor with the selected examples appended to the end of the file.

    Continue the process of finding and adding examples of more clues. This iterative labeling is an important part of the document analysis process that helps build an understanding of the documents you are working with and the clues you can use in your AQL. When you are satisfied that you have the proper clues, you can begin writing the AQL statements that help filter and extract the information that you need.

  7. It is recommended that you build your AQL from the bottom up, starting at the lowest level label and becoming more refined.