Input
- Normalize whitespace
- Remove any leading and trailing white space.
Multiple consecutive internal white space characters are replaced
by a single space in the input contexts before processing.
- Normalize new lines
-
Treat new line characters as white space. This option applies only
if the Normalize whitespace
option is in effect.
- Case-insensitive analysis
- Convert input contexts to lowercase
characters before processing.
Sequence Mining
These settings define what is considered as a frequent sequence when
you apply the Pattern Discovery algorithm. A
sequence
is a series of consecutive tokens that occur in the input contexts.
- Minimum Sequence Length
-
The minimum length of a sequence, in tokens, that is considered by
the algorithm when it determines the most frequent sequences.
For example, if the Minimum
Sequence Length is 2, in the context
can be reached
, the algorithm considers only the following sequences:
{can be; be reached; can be
reached}
(sequences are separated by ";"). The sequences
can, be, and reached
are not considered, since their length is less than 2.
- You can enter any integer value that is
greater than 0. The default value is 2.
- Maximum Sequence Length
-
The maximum length of a sequence, in tokens, that is considered by
the algorithm when it determines the most frequent sequences.
For example, if the Maximum
Sequence Length is 2, in the context
can be reached
, the algorithm considers only the following sequences:
{can; be; reached; can be; be
reached}
(sequences are separated by ";"). The sequence
can be reached
is not considered, since its length is more than 2.
- You can enter any integer value that is
greater than 0. The default value is 5.
- Minimum Sequence Frequency
-
The minimum number of times a sequence appears in the input
contexts to be considered frequent.
For example, suppose that there are two sequences,
can
with a frequency 15, and
he
with a frequency 5. If the Minimum
Sequence Frequency is 10, the second sequence
he
is disregarded.
-
You can enter any integer value. The following list contains the
recommended values for various corpus sizes:
- 5
- Recommended for small corpus (approximately
100 entries)
- 10
- Recommended for medium corpus (approximately
5000 entries)
- 15
- Recommended for large corpus (approximately
10,000 entries)
- 50
- Recommended for very large corpus
(approximately 100,000 entries)
Sequence support is computed across the entire
corpus. The same frequent sequence can be distributed across
different groups in the output. Therefore, the sum of the sizes
of all groups that contain the same frequent sequence is greater
than or equal to the minimum support, but the size of an
individual group in the output might be smaller than the
frequency of each individual sequence in the group.
Rules
These settings compute statistics from the frequent
sequences to determine the final semantic patterns.
- Sequence Correlation Measure Range
- The Correlation Measure determines how
similar two sequences are to each other and how important a
sequence is within the entire corpus.
-
The range specifies when two sequences are considered highly
correlated, and when one of them is disregarded.
Use the Rule History view to examine the
disregarded sequences because they were highly correlated with
other sequences.
- You can enter any range between 0 and 1. The
default range is between 0.2 and 1. The higher the value, the
higher the threshold for two sequences to be considered correlated.
-
Example: Sequence Correlation Measure Range between 0.2 and 1
Correlation between {can} and {be} = 0.7 - considered as highly correlated
Correlation between {can} and {reached} = 0.1
The algorithm considers disregarding one of {can} and {be}. It will not consider
disregarding either of {can} and {reached}.