Creating or modifying an extractor by using the Text Analytics Workflow perspective

Create or modify an extractor by using the Text Analytics workflow perspective in your Eclipse development environment to extract information from unstructured and semistructured data. Extractors help you analyze large volumes of text and produce annotated documents that provide valuable insights into unconventional data.

The included examples and code snippets are from the sampleTextAnalyticsExtractor_eclipse.zip project.

The Text Analytics Workflow perspective has six steps:

To improve performance, you should complete steps 2-4 in an iterative fashion. Each time that you perform these steps, your results are refined.

This tutorial is based mainly on the InfoSphere® BigInsights™ version 2.x Eclipse development environment. For a comparison with older versions, see What's New in InfoSphere BigInsights.

Step 1: Select documents

Procedure

  1. Select a data collection. A data collection is the set of documents from which you want to annotate or extract results. For information about valid collection formats, see Data collection formats.
    1. Select a collection.
    2. Select the language for the collection.
  2. Select the files that you want to work with.
    1. Open the initial set of documents that you want to annotate. Select the files, and click Open.

Example: Select documents

Suppose that you want to identify financial metrics from fourth quarter results, published in five files in the data/ibmQuarterlyReports directory. These files comprise the data collection.
4Q2006.txt
2006 fourth quarter results
4Q2007.txt
2007 fourth quarter results
4Q2008.txt
2008 fourth quarter results
4Q2009.txt
2009 fourth quarter results
4Q2010.txt
2010 fourth quarter results

You want to create the extractor in the FinancialIndicator project.

  1. Click Browse Workspace > FinancialIndicator > data > ibmQuarterlyReports, and click OK.
  2. Click 4Q2010.txt, and click OK.

Step 2: Label examples and clues

A text snippet is a sentence or phrase. Label the text snippets to identify what to extract.

Labels are meaningful identifiers of text that you want to extract. Labels also categorize various clues, which are used when you develop an extractor. There are two types of labels:
Top-level
Top-level identifier of the text snippets that you want to extract. An example of a top-level identifier is Indicator , financial indicators that include financial metrics and amounts.
Clues
Extraction clues are meaningful pieces of information that help you develop and organize your extractor. A clue label is nested inside a top-level label. Examples are the Metric and Amount clue labels.

Label snippets of interest

Before you begin

Ensure that you complete step 1 in the Text Analytics Workflow perspective and that the 4Q2010.txt file is open in the editor.

Procedure

  1. Right-click the text snippet of interest.
  2. Choose Add Example with New Label and type a label in the Label Name field. Leave the Parent Label field empty if the new label is a top-level label, or select an existing label as the parent. Click Finish.
  3. If this snippet is for an existing label, select Label Example As, and choose a label.

Example: Label snippets of interest

The 4Q2010.txt file contains 2010 fourth quarter results. Here is a portion of the file:
							Record revenue of $29.0 billion, up 7 percent as reported and adjusting for currency; 
Record net income of $5.3 billion, up 9 percent; 
Pre-tax income of $7 billion, up 9 percent; 
Gross profit margin of 49 percent, up 0.8 points; 
						

To identify financial indicators from the fourth quarter results, examine the 4Q2010.txt file to determine the financial indicators that you want to extract.

  1. Select the following text:
    								revenue of $29.0 billion
    							
    and right click.
  2. Choose Add Example with New Label and type Indicator in the Label Name field.

    The Indicator label is displayed to the right in the extraction plan.

  3. Select the following text:
    								net income of $5.3 billion
    							
    and right click.
  4. Choose Label Example As, and select Indicator.

Label extraction clues

Before you begin

Ensure that you labeled the snippets of interest in the open document.

About this task

Extraction clues divide an extraction task into smaller subtasks in a top-down approach. Take the following indicator snippet from the sample project: revenue of $29.0 billion . One way of dividing this snippet is into these two parts or clues:
  • Revenue
  • $29.0 billion
Clues can be categorized across two dimensions:
Inside or outside
Categorizes if the clue is inside or outside of the labeled text.
Inside
The clue is included in or part of the labeled text.
Suppose that you want to extract absolute dollar amounts from the following text snippets. In this example, $8.7 billion , $5.3 billion , 2 percent , and 11 percent are labeled as Amount.
												cash flow of $8.7 billion
net income was $5.3 billion
revenues increased 2 percent
revenue excluding divested PLM operations up 11 percent
											
The following are inside clues:
  • Numbers: 8.7, 5.3, 2, and 11
  • Units: billion, percent
  • Currency: $
Outside
The clue is excluded from or not part of the text.

Outside clues are words such as: of, was, increased, and up.

Positive or negative
Categorizes if the clue indicates a valid candidate for extraction or a candidate that must not be extracted.
Positive
The clue indicates a valid candidate for extraction.

In the preceding example, positive clues are the words: of and was. These words precede mentions of absolute dollar amounts.

Negative
The clue indicates a candidate that must not be extracted.

In the preceding example, negative clues are the words: increased and up. These words precede mentions of percentages, not absolute dollar amounts.

Suppose that you want to label the extraction clues in the text snippets that are displayed in the extraction plan. The following snippets are in the extraction plan:
  • Record revenue of $29.0 billion
  • Record net income of $5.3 billion

Procedure

  1. Select the word: Revenue, and right click.
  2. Choose Add Example with New Label, type Metric in the Label Name field, and choose Indicator from the Parent Label menu. Click Finish.

    The label is displayed in the extraction plan.

  3. Select: $29.0 billion, type Amount in the Label Name field, and choose Indicator from the Parent Label menu.

Step 3: Develop the extractor

To develop an extractor, create AQL statements that extract mentions that are similar to the example text under the labels in the extraction plan.
Tip: Create AQL statements to extract the example text for the labels at the broadest level first. Then create AQL statements for the next highest level, and so on.

There are four categories of AQL statements. Create them in the following order. This order is a best practices approach to developing an extractor.

Creating basic feature rules
Commonly used statements that define regular expressions, dictionaries, and parts of speech, which are the basic features for extraction.
Creating candidate generation rules
Commonly used statements that combine basic features into complete candidates.
Creating filter and consolidation rules
Commonly used statements that filter and consolidate candidates to remove mistakes.
Creating other types of rules
Other types of rules can be created by adding the rule directly into the AQL files for the extractor.

Creating basic feature rules

Basic feature rules define the basic features of an extractor.

Procedure

  1. In the extraction plan, select a label for which the basic feature rule is created.
  2. Right-click New AQL Statement, and select Basic Feature AQL Statement.
    Tip: Instead of right-clicking the label, you can also right-click the BasicFeatures or its AQL Statement folder and select Basic Feature AQL Statement.
    Tip: Other types of statements can define basic features. See Creating other types of rules.
    1. In the View Name field, type the name of the view.
    2. In the AQL Module field, select the module.
      Tip: You do not need to create AQL modules. Four AQL modules are created automatically for each top-level label, one for each rule category.
    3. Select a file in the AQL Script field, or type the name of a new AQL script.
    4. Select Regular expression, Dictionary, or Part of speech from the Type menu.
    5. Optional: Select Output View to see the results of this rule. Select Export View to make the rule available for other modules.
    6. Click OK. A statement template is inserted in the selected AQL script in the selected AQL module.
    7. Complete the template.

    For more information, see the following topics:

Example: Creating a regular expression

Before you begin

Ensure that the AQL script is open in the editor.

About this task

Suppose that you want to identify numeric amounts such as 11 , including optional decimal numbers such as .1 .

Procedure

Assume that you added a regular expression statement named Number. Complete the regular expression template as in the following example:
							create view Number as
extract regex /\d+(\.\d+)?/ 
on R.text as match
from Document R; 
						

This rule identifies matches of the regular expression in the input text. For more information, see Regular expressions and Generating regular expressions.

Example: Creating a dictionary

Before you begin

Ensure that the AQL script is open in the editor.

About this task

Suppose that you want to identify monetary units such as million, billion, and so on.

Procedure

  1. Assume that you added a dictionary statement named Unit. Complete the dictionary statement template as in the following example.
    								create dictionary UnitDict
    from file 'dictionaries/unit.dict'
    with language as 'en';
    
    create view Unit as
    extract dictionary 'UnitDict' 
    	on R.text as match
    from Document R;
    							

    This rule defines a dictionary of terms and identifies all matches of that dictionary in the input text. For more information, see Create dictionary and Dictionaries.

  2. Create a dictionary file that contains at least the following entries:
    								million
    billion
    							
  3. Save the dictionary file.

Creating candidate generation rules

About this task

Create a candidate generation rule to combine basic features into complete candidates.

Procedure

  1. In the extraction plan, select a label for the new candidate generation rule.
  2. Right-click New AQL Statement, and select Candidate Generation AQL Statement.
    Tip: Instead of right-clicking the label, you can also right-click the CandidateGeneration or its AQL Statement folder and select Candidate Generation AQL Statement.
    1. In the View Name field, type the name of the view.
    2. In the AQL Module field, select the module.
    3. Select a file in the AQL Script field, or type the name of a new AQL script.
    4. Choose Pattern Union All Block, or Select from the Type menu.
    5. Optional: Select Output View to see the results of this rule. Select Export View to make the rule available for other modules.
    6. Click OK. A statement template is inserted in the selected AQL script in the selected AQL module.
    7. Complete the template.

    For more information, see the following topics:

Example: Creating a pattern candidate generation rule

About this task

Suppose that you want to identify absolute amounts such as $7 billion and $11.52 .

Procedure

Assume that you added a pattern statement named AmountAbsolute and defined views named Number and Unit. Complete the template as follows.
							create view AmountAbsolute as
extract pattern /\$/ <N.match> <U.match>
return group 0 as match
from Number N, Unit U;
						

This rule identifies a pattern that consists of the $ character followed by a number and unit. For more information, see Sequence patterns.

Example: Creating a union candidate generation rule

About this task

Suppose that you want to group or join all of the amount candidates, including absolute amounts and percentage amounts.

Procedure

Assume that you added a union statement named AmountCandidate and defined views named AmountAbsolute and AmountPercent. Complete the rule template as follows.
							create view AmountCandidate as
(select R.* from AmountAbsolute R)
union all
(select R.* from AmountPercent R); 
						

This rule joins mentions that are identified by other rules. For more information, see create view statement.

Example: Creating a select candidate generation rule

About this task

Suppose that you want to select indicator candidates, where an indicator is formed by a metric that is followed by an amount within 0 to 10 tokens.

Procedure

Assume that you added a select rule named IndicatorCandidate and defined views named Metric and Amount. Complete the rule template as follows.
							create view IndicatorCandidate as
select M.match as metric, A.match as amount, CombineSpans(M.match, A.match) as match 
from Metric M, Amount A
where FollowsTok(M.match, A.match, 0, 10); 
						

For more information, see select statement.

Creating filter and consolidation rules

You can create commonly used statements that filter and consolidate candidates to remove mistakes.

Procedure

  1. In the extraction plan, select a label for which the filter and consolidate rule is created.
  2. Right-click New AQL Statement, and select Filter and Consolidate AQL statement.
    Tip: Instead of right-clicking the label, you can also right-click FilterConsolidate or its AQL Statement folder and select Filter and Consolidate AQL Statement.
    1. In the View Name field, type the name of the view.
    2. Select the module in the AQL Module field.
    3. Select a file in the AQL Script field, or type in the name of a new AQL script.
    4. Select Predicate-based Filter, Set-based Filter or Consolidate from the Type menu.
    5. Optional: Select Output View to see the results of this rule. Select Export View to make the rule available for other modules.
    6. Click OK. A statement template is inserted in the selected AQL script in the selected AQL module.

    For more information, see the following topic:

Example: Creating a predicate-based filter

About this task

Suppose that you want to remove relative amounts, such as percentages. Relative amounts are preceded by a negative clue. For more information about negative clues, see Label extraction clues.

Procedure

Add a rule that is similar to the following one:
							create view Amount as
select R.match 
from AmountCandidate R
where Not(MatchesDict('AmountNegativeClueDict', LeftContextTok(R.match,1))); 
						

This rule retains the Amount candidates whose immediate left context does not contain a match of a negative clue. The rule uses the LeftContextTok() built-in scalar function to compute the immediate left context of a span and the MatchesDict() built-in predicate for verifying if a span contains a dictionary clue. For more information, see Built-in functions.

Example: Creating a set-based filter

About this task

Suppose that you want to remove invalid indicators, and assume that you defined a view IndicatorInvalid to capture all invalid Indicator candidates. For more information about negative clues, see Label extraction clues.

Procedure

Add a filter that is similar to the following one:
							create view IndicatorAll as
(select R.metric, R.amount, R.match from IndicatorCandidate R)
minus
(select R.metric, R.amount, R.match from IndicatorInvalid R);  
						

This rule removes a set of invalid Indicator candidates from the set of all candidates. For more information, see Third form of the Create view statement (minus).

Example: Creating a consolidation rule

About this task

Suppose that you want to consolidate overlapping indicators and retain the indicators that are not contained in longer mentions.

Procedure

Add a rule that is similar to the following one:
							create view Indicator as
select R.metric, R.amount, R.match
from IndicatorAll R
consolidate on R.match using 'NotContainedWithin';  
						

For more information, see the consolidate clause.

Creating other types of rules

About this task

Other types of rules can be created by adding the rule directly into the AQL files for the extractor. See the AQL Reference for a list of all the rules.

Procedure

  1. Write the rule in the editor.
  2. Select the name of the view in the create view statement.
  3. Drag the name to the appropriate node in the extraction plan.

    You can add the rule as a basic feature, candidate generation, or filter and consolidation rule.

Step 4: Test the extractor

Before you begin

Ensure that you complete steps 1-3 in the Text Analytics Workflow perspective.

About this task

You can test your extractor on any of the following documents:
  • The entire data collection
  • Documents that you select in the Open a subset of documents to work with field
  • Labeled documents
You can test any module that is created by an extraction plan for top-level labels on the these sets of documents.

Procedure

  1. To run the extractor, right-click any node in the extraction plan, and select one of the run options:
    • Run the extraction plan on the entire data collection
    • Run the extraction plan on the set of selected documents
    • Run the extraction plan on the set of documents that are labeled

    Results are displayed in the Annotation Explorer.

    To run the module, right-click a category node of a top-level label, which is named as <label-name>_<category-name> , and choose a Run Module option:
    • Run module <module-name> on the entire data collection
    • Run module <module-name> on the set of selected documents
    • Run module <module-name> on the set of documents that are labeled
  2. Identify mistakes in the extracted results.
    Look for:
    False positives
    Results that are incorrectly identified by the extractor.

    Use the Provenance View to identify the rules that cause false positives. For more information, see Provenance View.

    False negatives
    Results that were not identified by the extractor.
  3. Go back to the Step 2: Label examples and clues, and Step 3: Develop the extractor steps in the Text Analytics Workflow perspective to write new rules or refine your existing ones.
  4. Repeat steps 2 through 4 until you are satisfied with the results.

Step 5: Evaluate and improve the performance of the extractor

Before you begin

Ensure that steps 1-4 in the Text Analytics Workflow perspective are complete.

About this task

After you are satisfied with the quality of your extractor, you can evaluate its performance. Use the Profiler View to evaluate and improve the runtime performance of the extractor.

Procedure

  1. Remove all output view statements added in step 3 for testing your extractor, and keep only those that are necessary to generate the final outputs of your extractor. Ideally, these output views exist only in the <label-name>_Finals modules.
  2. Evaluate the performance of the extractor. For more information about performance, see the Profiler View.
  3. If necessary, tune the extractor. See AQL Profiler reports for instructions about tuning the extractor.

Step 6: Export the extractor

Before you begin

Ensure that steps 1-5 in the Text Analytics Workflow perspective are complete.

About this task

After you are satisfied with the quality and performance of your extractor, you can export the extractor or its modules to a file or a directory so it can be used by other extractors.

Procedure

  1. Right-click the project of the extractor to be exported, and select Export.
  2. In the Export wizard, select BigInsights > Export Text Analytics Extractor > Next. The Extractor Export Specification wizard page opens.
  3. For a modular project, you can select the modules to be exported. Specify Export dependent modules to export the dependent modules for the selected module. For a non-modular project, you can export only the entire extractor.
  4. Click Browse workspace or Browse file system to select the directory to which you want to export.
  5. Optional: Select Export to destination directory to export the selected modules to the selected directory, or select Export to a jar or zip archive under the destination directory to bundle the selected module into a JAR or ZIP archive.
  6. Click Finish to export the modules.