Pattern-discovery configuration: Advanced settings

As you discover patterns in text input, you can also refine your pattern-discovery configuration to optimize results.

Input

Normalize whitespace
Remove any leading and trailing white space. Multiple consecutive internal white space characters are replaced by a single space in the input contexts before processing.
Normalize new lines
Treat new line characters as white space. This option applies only if the Normalize whitespace option is in effect.
Case-insensitive analysis
Convert input contexts to lowercase characters before processing.

Sequence Mining

These settings define what is considered as a frequent sequence when you apply the Pattern Discovery algorithm. A sequence is a series of consecutive tokens that occur in the input contexts.
Minimum Sequence Length
The minimum length of a sequence, in tokens, that is considered by the algorithm when it determines the most frequent sequences.

For example, if the Minimum Sequence Length is 2, in the context can be reached , the algorithm considers only the following sequences: {can be; be reached; can be reached} (sequences are separated by ";"). The sequences can, be, and reached are not considered, since their length is less than 2.

You can enter any integer value that is greater than 0. The default value is 2.
Maximum Sequence Length
The maximum length of a sequence, in tokens, that is considered by the algorithm when it determines the most frequent sequences.

For example, if the Maximum Sequence Length is 2, in the context can be reached , the algorithm considers only the following sequences: {can; be; reached; can be; be reached} (sequences are separated by ";"). The sequence can be reached is not considered, since its length is more than 2.

You can enter any integer value that is greater than 0. The default value is 5.
Minimum Sequence Frequency
The minimum number of times a sequence appears in the input contexts to be considered frequent.

For example, suppose that there are two sequences, can with a frequency 15, and he with a frequency 5. If the Minimum Sequence Frequency is 10, the second sequence he is disregarded.

You can enter any integer value. The following list contains the recommended values for various corpus sizes:
5
Recommended for small corpus (approximately 100 entries)
10
Recommended for medium corpus (approximately 5000 entries)
15
Recommended for large corpus (approximately 10,000 entries)
50
Recommended for very large corpus (approximately 100,000 entries)

Sequence support is computed across the entire corpus. The same frequent sequence can be distributed across different groups in the output. Therefore, the sum of the sizes of all groups that contain the same frequent sequence is greater than or equal to the minimum support, but the size of an individual group in the output might be smaller than the frequency of each individual sequence in the group.

Rules

These settings compute statistics from the frequent sequences to determine the final semantic patterns.

Sequence Correlation Measure Range
The Correlation Measure determines how similar two sequences are to each other and how important a sequence is within the entire corpus.
The range specifies when two sequences are considered highly correlated, and when one of them is disregarded.

Use the Rule History view to examine the disregarded sequences because they were highly correlated with other sequences.

You can enter any range between 0 and 1. The default range is between 0.2 and 1. The higher the value, the higher the threshold for two sequences to be considered correlated.
Example: Sequence Correlation Measure Range between 0.2 and 1
						Correlation between {can} and {be} = 0.7 - considered as highly correlated
Correlation between {can} and {reached} = 0.1

The algorithm considers disregarding one of {can} and {be}. It will not consider 
   disregarding either of {can} and {reached}.