You use a pattern-discovery configuration to
identify themes and trends in a data collection. Pattern discovery
identifies contextual clues within documents that helps refine the
accuracy and coverage of an extractor. Your configuration settings
are saved between different invocations of pattern discovery.
Procedure
- Right-click
a Text Analytics project, and click .
- In
the Run Configurations
window, right-click in the navigation pane to create a
new configuration.
- In
the Main tab, specify your
settings:
- Select
the Output view.
The Output view field is
pre-populated with the output views of the extractor.
- Specify
a value for the Group on field.
The Group on field is
pre-populated with the type Span attributes of the selected
Output view. The attribute selected for the Group on field defines the input
contexts on which to do Pattern Discovery.
- In
the Entities to Consider Type
Only field, select the field names, if any, to include as more
information during pattern discovery.
Before doing pattern discovery, all occurrences of these entities
in the input snippets are replaced by the type of the entity.
Note: You cannot select the same
attribute as the one indicated in the
Group
on field. See
Pattern
discovery scenarios for an example pattern that is generated
when you select this option.
- In
the Snippet Field Name field,
select the attribute that contains the contexts that you want to
analyze.
A
snippet
is a larger region of text that contains each of the input
contexts. Snippets are displayed in the Expanded Pattern Context
Viewer, which allows you to examine the input contexts of a
particular pattern in a larger context. This field is populated
automatically with all attributes in the AQL view that are used
to define the input contexts, along with a special value
Default_Snippet
. The
Default_Snippet
is the default selection, and represents 25 characters to the
left and right of each input context. Therefore, by default, the
Expanded Pattern Context Viewer displays snippets of text that
contain the 25 characters that precede and follow an input
context. To use a custom snippet value:
- Specify the intended value of the snippet as
an additional attribute in the AQL view that is used to define
the input contexts.
- Select that attribute in this field. For
example, the attribute customSnippet
in the following AQL view defines a custom snippet of 10 tokens
to the left and right of the input context.
create view PhoneCandidateContext as
select LeftContextTok(P.num, 4) as context,
CombineSpans(LeftContextTok(P.num, 10),
RightContextTok(P.num, 10)) as customSnippet
from PhoneCandidate P;
output view PhoneCandidateContext;
The selected attribute name must be different from those selected
in the Group On and Entities to Consider Type Only
fields.
- Select
the language of the data collection.
- Click
Browse Workspace or Browse File System to select the
location of the data collection.
Tip: Select the Show All Files check box to display
all files in the dialog, including those that have unsupported
file extensions. For example, .avi
is not a supported format for a data collection. However, if Show All Files is selected, AVI files or directories that contain
.avi files are displayed in the
dialog.
Valid entries for the data collection field are provided in Data
collection formats.
- If
external dictionaries and tables are required by the extractor,
click the corresponding tabs to configure them.
- Optional: Click
the Advanced tab to further
customize pattern discovery.
Restriction: The
pattern-discovery algorithm can only use the multilingual
tokenizer. You cannot specify another tokenizer in the
pattern-discovery configuration. For more information about the
multilingual tokenizer, see
Tokenization.
To rerun a recently launched configuration, select the
configuration from the Run
toolbar button menu.