Creating and executing pattern-discovery configurations

You use a pattern-discovery configuration to identify themes and trends in a data collection. Pattern discovery identifies contextual clues within documents that helps refine the accuracy and coverage of an extractor. Your configuration settings are saved between different invocations of pattern discovery.

About this task

Procedure

  1. Right-click a Text Analytics project, and click Run As > Run Configurations.
  2. In the Run Configurations window, right-click Pattern Discovery > New in the navigation pane to create a new configuration.
  3. In the Main tab, specify your settings:
    1. Select the Output view. The Output view field is pre-populated with the output views of the extractor.
    2. Specify a value for the Group on field. The Group on field is pre-populated with the type Span attributes of the selected Output view. The attribute selected for the Group on field defines the input contexts on which to do Pattern Discovery.
    3. In the Entities to Consider Type Only field, select the field names, if any, to include as more information during pattern discovery. Before doing pattern discovery, all occurrences of these entities in the input snippets are replaced by the type of the entity.
      Note: You cannot select the same attribute as the one indicated in the Group on field. See Pattern discovery scenarios for an example pattern that is generated when you select this option.
    4. In the Snippet Field Name field, select the attribute that contains the contexts that you want to analyze.

      A snippet is a larger region of text that contains each of the input contexts. Snippets are displayed in the Expanded Pattern Context Viewer, which allows you to examine the input contexts of a particular pattern in a larger context. This field is populated automatically with all attributes in the AQL view that are used to define the input contexts, along with a special value Default_Snippet . The Default_Snippet is the default selection, and represents 25 characters to the left and right of each input context. Therefore, by default, the Expanded Pattern Context Viewer displays snippets of text that contain the 25 characters that precede and follow an input context. To use a custom snippet value:

      1. Specify the intended value of the snippet as an additional attribute in the AQL view that is used to define the input contexts.
      2. Select that attribute in this field. For example, the attribute customSnippet in the following AQL view defines a custom snippet of 10 tokens to the left and right of the input context.
        									create view PhoneCandidateContext as
        select LeftContextTok(P.num, 4) as context,
        	CombineSpans(LeftContextTok(P.num, 10), 
         RightContextTok(P.num, 10)) as customSnippet
        from PhoneCandidate P;      
        			
        output view PhoneCandidateContext;       
        								

      The selected attribute name must be different from those selected in the Group On and Entities to Consider Type Only fields.

    5. Select the language of the data collection.
    6. Click Browse Workspace or Browse File System to select the location of the data collection.
      Tip: Select the Show All Files check box to display all files in the dialog, including those that have unsupported file extensions. For example, .avi is not a supported format for a data collection. However, if Show All Files is selected, AVI files or directories that contain .avi files are displayed in the dialog.
      Valid entries for the data collection field are provided in Data collection formats.
  4. If external dictionaries and tables are required by the extractor, click the corresponding tabs to configure them.
  5. Optional: Click the Advanced tab to further customize pattern discovery.
    Restriction: The pattern-discovery algorithm can only use the multilingual tokenizer. You cannot specify another tokenizer in the pattern-discovery configuration. For more information about the multilingual tokenizer, see Tokenization.

    To rerun a recently launched configuration, select the configuration from the Run toolbar button menu.