Given an input corpus and an initial extractor, the Pattern discovery, a feature included in the InfoSphere® BigInsights™ Eclipse tooling, groups input contexts with similar semantics and distills patterns from them. You can then decide which of the patterns needs to be implemented in the extractor.
By using AQL, you can write rules to extract structured information from unstructured text. However, to identify the rules to accurately extract the information, you must manually sift through a potentially large data collection, a time consuming and error-prone process.
create view PhoneCandidate as
extract
regexes /\d{3}-\d{3}-\d{4}/
on D.text as num
from Document D;
output view PhoneCandidate;
This rule captures many valid telephone numbers that have the form xxx-xxx-xxxx . It also captures invalid phone numbers, such as fax numbers, which are specified using the same format. However, it does not capture many other phone numbers, such as international numbers or extension numbers.
'Call me at 555-123-4567'
'Fax#: 555-123-4567'
create dictionary FaxClueDict as (
'fax'
);
create view PhoneSimple as
extract
regexes /\+?\(\d{3}\) ?\d{3}-\d{4}/ and /\+?\d{3}-\d{3}-\d{4}/
on D.text as num
from Document D
having Not(ContainsDict('FaxClueDict', LeftContextTok(num, 4)));
these patterns indicate that the keyword phone and the keywords call and at (in this order) commonly occur in the vicinity of a telephone number candidate.
create dictionary PhoneClueDict as (
'phone', 'call', 'at'
);
create view PhoneSimpleStrong as
select P.*
from PhoneSimple P
where ContainsDict('PhoneClueDict', LeftContextTok(num, 4));
create view PhoneClue as
extract dictionary 'PhoneClueDict' on D.text as clue
from Document D;
-- Right context of a phone clue containing digits
create view PhoneClueContextAll as
select P.clue as clue, RightContextTok(P.clue, 15) as rightCtx
from PhoneClue P
where ContainsRegex(/\d{3}/, RightContextTok(P.clue, 15));
-- Right context of a phone clue with a known phone candidate
create view PhoneClueContextKnownPhone as
select P.*
from PhoneClueContextAll P, PhoneSimple Ph
where Contains(P.rightCtx, Ph.num);
-- Right context of a phone clue without an already identified phone candidate
create view PhoneClueRightContext as
(select * from PhoneClueContextAll)
minus
(select * from PhoneClueContextKnownPhone);
output view PhoneClueRightContext;
create view PhoneSimple as
extract
regexes /\+?\(\d{3}\) ?\d{3}-\d{4}/ and /\+?\d{3}-\d{3}-\d{4}/ and /x-?\d{4,5}/
on D.text as num
from Document D;
This pattern indicates that a <PhoneSimple.num> match and the keyword phone commonly occur in the vicinity of a telephone number candidate.
This pattern is distilled from contexts such as 'Reach me at 555-123-4567 or ' where the occurrence of 555-123-4567 is replaced by the corresponding entity marker <PhoneSimple.num> .
When you develop an extractor, you might have access to a list of known instances of an entity. You can use Pattern Discovery to identify common patterns that help you write more general rules to capture matches of the entity. The following actions illustrate this scenario:
You are developing an organization extractor, and you have access to a list of Fortune 1000 organizations. By using Pattern Discovery, you can identify a list of common organization suffixes from the Fortune 1000 organization list, such as 'Co.' and 'Corporation' . Create a dictionary of these strong organization suffixes. Use that dictionary in a rule to identify candidate organizations as one or more capitalized tokens followed by a match for the suffix in the dictionary.
When you write an extractor to identify relationships between entities, you can use Pattern Discovery to identify strong clues that are indicative of relationships in the input data set. The following actions illustrate this scenario:
Assume that you are developing an extractor to extract relationships between Person and Organization mentions. By using Pattern Discovery on the context between two mentions of these entities, you can discover common patterns, such as 'works for' , or 'is employed by' .