Evaluating the quality of an extractor by combining the labeled data collection with the Annotation Difference Viewer

You can evaluate the quality of an extractor by comparing the results of one extractor run to a labeled data collection. By comparing the results, you can determine the accuracy and completeness of an extractor against a standard that you establish.

Before you begin

Procedure

  1. Run the extractor that you want to evaluate.

    The results of the run are saved in a new subdirectory named result-system-timestamp in the results directory.

  2. Under the Results directory, right-click the result-system-timestamp subdirectory, and select Compare Text Analytics Result With > Labeled Collection.
  3. Select the labeled data collection, and click OK.

    The comparison is shown in the Annotation Difference Viewer .

    Note: The comparison is shown only for those annotation types that are common between the results folder and the labeled collection (view name and attribute name). For example, if the results have annotation types Person.name and Phone.num , but the labeled data collection has only the type Person.name , the comparison is only shown for Person.name .
  4. Evaluate the quality of your extractor.
    1. Evaluate the accuracy of your extractor.

      The precision measure is a reflection of the accuracy of your extractor. Your goal is to create an extractor that is as precise as possible.

    2. Evaluate the completeness of your extractor.

      The recall measure is a reflection of the completeness, or coverage, of your extractor.

    Labeled collection details apply only when you are comparing the results of an extractor run to a labeled data collection. The measures that are described in Annotation Difference Viewer are the standard measures that are used in natural language processing.

    Table 1. Labeled Collection measures
    Column Description
    Resource Name of the output view. Nested beneath are the following rows:
    Exact
    An exact calculation of the precision, recall, and F-measure.

    Overlapping results that are identified as incorrect.

    Partial
    The calculation of the precision, recall, and F-measure of overlapping results are counted as partially correct.

    For example, if 8 of the 13 characters in a result are overlapping, the result might be counted as 8/13 or 0.62

    Relaxed
    The calculation of the precision, recall, and F-measure when the overlapping results are counted as correct.
    Precision The percentage of the results that the extractor identified as correct that are correct according to the labeled data collection. The higher the precision, the better the extractor is at extracting results that are annotated in the labeled data collection.

    For example, the extractor extracted five phone numbers from an input file, but only four of them are correct based on the labeled data collection. The precision is 4/5 or 80% .

    Recall Compared to the number of results that are in the labeled data collection, the percentage of the extracted results that is correct.

    For example, according to the labeled data collection, there are 8 correct phone numbers. The extractor extracted 5 phone numbers. However, only 4 of those 5 numbers are correct. The recall is 4/8 or 50% .

    F-Measure A weighted average of the precision and recall measures, computed as
    										2 x Precision x Recall / (Precision + Recall)