Creating a labeled data collection

A labeled data collection is the standard way to evaluate the accuracy and completeness of an extractor.

Before you begin

Labeled data collections are supported by Text Analytics for all data collection formats except for CSV with header and JSON.

  1. Ensure that a BigInsights project exists.
  2. Right-click the project, and click Labeled Data Collection.

About this task

You can create a labeled data collection from the results of an extractor or from a data collection.

Procedure


To create a labeled data collection from: Perform the following steps:
A data collection
  1. Click Import from data collection.
  2. Select the data collection that you want to import, and click OK.

    The documents are imported into the labeledCollections directory, which is part of the project. For every file that is imported, a corresponding LC file is created in a subdirectory of the labeledCollections directory.

  3. Configure the labeled data collection.

    When the import is complete, you can configure the data collection immediately by opening the Configuration page from the message panel. Or configure the labeled data collection later.

  4. Label the documents in the data collection.
Extracted results
  1. Click Import from extraction result.
  2. Select the extraction result from the project that you want to use.

    The results are imported into the labeledCollections directory. In addition to the annotated text, the annotation types and output views are also imported.

    Restriction: A result set that contains values of type SPAN over any field other than Document.text cannot be imported.
  3. Label the documents in the data collection.
Tip: If you run an extractor and it produces reasonably good results, consider creating a labeled data collection from the extracted results for later use.

Labeling documents in a data collection

You must label the text in the data collection so that the Text Analytics system can compare the labeled text with the extractor results. Based on this comparison, the Text Analytics system computes labeled collection measures, which you can use to evaluate the quality of your extractor.

Procedure

  1. Open each document (.lc file) in the editor, and manually label the document.
    1. Select the text that you want to label.
    2. Right-click the selected text. From the Annotate As.. menu, choose the annotation type that you want to assign to the selected text.
    3. Repeat the previous steps until you label all the text that is of interest in the file.
      Restriction: If a view has multiple fields or attributes, you must:
      • Label every field in a tuple before you can label the next tuple.
      • For a view name, label the document in the same order that the fields are defined in the Annotation types configuration page.

      The .lc file is saved automatically after the text is labeled.

      Tip:

      You can apply the default annotation type to a span of selected text by pressing Ctrl+Enter.

      To add a new annotation type to the labeled data collection, you must Configure the labeled data collection.

  2. Mark the document as complete. In the Project Explorer, right-click the file, and choose Labeled Data Collection > Mark as Complete.

    You can also mark the document as complete by placing your cursor inside the Labeled Document editor, right-clicking, and choosing Mark as Complete.