Create or modify an extractor by using the Text Analytics workflow perspective in your Eclipse development environment to extract information from unstructured and semistructured data. Extractors help you analyze large volumes of text and produce annotated documents that provide valuable insights into unconventional data.
The included examples and code snippets are from the sampleTextAnalyticsExtractor_eclipse.zip project.
The Text Analytics Workflow perspective has six steps:
To improve performance, you should complete steps 2-4 in an iterative fashion. Each time that you perform these steps, your results are refined.
This tutorial is based mainly on the InfoSphere® BigInsights™ version 2.x Eclipse development environment. For a comparison with older versions, see What's New in InfoSphere BigInsights.
You want to create the extractor in the FinancialIndicator project.
A text snippet is a sentence or phrase. Label the text snippets to identify what to extract.
Ensure that you complete step 1 in the Text Analytics Workflow perspective and that the 4Q2010.txt file is open in the editor.
Record revenue of $29.0 billion, up 7 percent as reported and adjusting for currency;
Record net income of $5.3 billion, up 9 percent;
Pre-tax income of $7 billion, up 9 percent;
Gross profit margin of 49 percent, up 0.8 points;
To identify financial indicators from the fourth quarter results, examine the 4Q2010.txt file to determine the financial indicators that you want to extract.
revenue of $29.0 billion
and right click.
The Indicator label is displayed to the right in the extraction plan.
net income of $5.3 billion
and right click.
Ensure that you labeled the snippets of interest in the open document.
cash flow of $8.7 billion
net income was $5.3 billion
revenues increased 2 percent
revenue excluding divested PLM operations up 11 percent
Outside clues are words such as: of, was, increased, and up.
In the preceding example, positive clues are the words: of and was. These words precede mentions of absolute dollar amounts.
In the preceding example, negative clues are the words: increased and up. These words precede mentions of percentages, not absolute dollar amounts.
There are four categories of AQL statements. Create them in the following order. This order is a best practices approach to developing an extractor.
Basic feature rules define the basic features of an extractor.
Ensure that the AQL script is open in the editor.
Suppose that you want to identify numeric amounts such as 11 , including optional decimal numbers such as .1 .
create view Number as
extract regex /\d+(\.\d+)?/
on R.text as match
from Document R;
This rule identifies matches of the regular expression in the input text. For more information, see Regular expressions and Generating regular expressions.
Ensure that the AQL script is open in the editor.
Suppose that you want to identify monetary units such as million, billion, and so on.
Suppose that you want to identify absolute amounts such as $7 billion and $11.52 .
create view AmountAbsolute as
extract pattern /\$/ <N.match> <U.match>
return group 0 as match
from Number N, Unit U;
This rule identifies a pattern that consists of the $ character followed by a number and unit. For more information, see Sequence patterns.
Suppose that you want to group or join all of the amount candidates, including absolute amounts and percentage amounts.
create view AmountCandidate as
(select R.* from AmountAbsolute R)
union all
(select R.* from AmountPercent R);
This rule joins mentions that are identified by other rules. For more information, see create view statement.
Suppose that you want to select indicator candidates, where an indicator is formed by a metric that is followed by an amount within 0 to 10 tokens.
create view IndicatorCandidate as
select M.match as metric, A.match as amount, CombineSpans(M.match, A.match) as match
from Metric M, Amount A
where FollowsTok(M.match, A.match, 0, 10);
For more information, see select statement.
You can create commonly used statements that filter and consolidate candidates to remove mistakes.
Suppose that you want to remove relative amounts, such as percentages. Relative amounts are preceded by a negative clue. For more information about negative clues, see Label extraction clues.
create view Amount as
select R.match
from AmountCandidate R
where Not(MatchesDict('AmountNegativeClueDict', LeftContextTok(R.match,1)));
This rule retains the Amount candidates whose immediate left context does not contain a match of a negative clue. The rule uses the LeftContextTok() built-in scalar function to compute the immediate left context of a span and the MatchesDict() built-in predicate for verifying if a span contains a dictionary clue. For more information, see Built-in functions.
create view IndicatorAll as
(select R.metric, R.amount, R.match from IndicatorCandidate R)
minus
(select R.metric, R.amount, R.match from IndicatorInvalid R);
This rule removes a set of invalid Indicator candidates from the set of all candidates. For more information, see Third form of the Create view statement (minus).
Suppose that you want to consolidate overlapping indicators and retain the indicators that are not contained in longer mentions.
create view Indicator as
select R.metric, R.amount, R.match
from IndicatorAll R
consolidate on R.match using 'NotContainedWithin';
For more information, see the consolidate clause.
Other types of rules can be created by adding the rule directly into the AQL files for the extractor. See the AQL Reference for a list of all the rules.
Ensure that you complete steps 1-3 in the Text Analytics Workflow perspective.
Ensure that steps 1-4 in the Text Analytics Workflow perspective are complete.
After you are satisfied with the quality of your extractor, you can evaluate its performance. Use the Profiler View to evaluate and improve the runtime performance of the extractor.
Ensure that steps 1-5 in the Text Analytics Workflow perspective are complete.
After you are satisfied with the quality and performance of your extractor, you can export the extractor or its modules to a file or a directory so it can be used by other extractors.