Pattern discovery scenarios

Given an input corpus and an initial extractor, the Pattern discovery, a feature included in the InfoSphere® BigInsights™ Eclipse tooling, groups input contexts with similar semantics and distills patterns from them. You can then decide which of the patterns needs to be implemented in the extractor.

By using AQL, you can write rules to extract structured information from unstructured text. However, to identify the rules to accurately extract the information, you must manually sift through a potentially large data collection, a time consuming and error-prone process.

Example of usage

Consider the following simple phone extractor:
					create view PhoneCandidate as
extract
 	regexes /\d{3}-\d{3}-\d{4}/
	    on D.text as num
from Document D;   

output view PhoneCandidate;  
				

This rule captures many valid telephone numbers that have the form xxx-xxx-xxxx . It also captures invalid phone numbers, such as fax numbers, which are specified using the same format. However, it does not capture many other phone numbers, such as international numbers or extension numbers.

You can use Pattern discovery to find patterns in the vicinity of a telephone number candidate. Based on these patterns, you can then write additional rules to refine the phone extractor. You can see in the following example that the left context can usually reveal whether the candidate is valid or invalid:
Valid candidate (telephone number)
							'Call me at 555-123-4567'
						
Invalid candidate (fax number, not a telephone number)
							'Fax#: 555-123-4567'
						
If you used Pattern discovery on these contexts, it automatically finds such negative and positive clues. The following examples are patterns that the algorithm might discover, along with suggestions for using these patterns to develop additional rules to improve the precision and recall of the phone extractor. In the example coding, the Pattern Discovery algorithm is invoked on the context of length 4 tokens immediately preceding a phone annotation.
Example 1: 'fax'
This pattern indicates that the keyword fax commonly precedes a telephone number candidate. It is a negative clue that can be used to improve the precision of the phone extractor by filtering out invalid candidates, such as in the following example:
								 create dictionary FaxClueDict as (
		'fax'
	);
	
	create view PhoneSimple as
	extract
 	regexes /\+?\(\d{3}\) ?\d{3}-\d{4}/ and /\+?\d{3}-\d{3}-\d{4}/ 
	    on D.text as num
	from Document D
having Not(ContainsDict('FaxClueDict', LeftContextTok(num, 4)));
							
Example 2: 'phone' or 'call;at'

these patterns indicate that the keyword phone and the keywords call and at (in this order) commonly occur in the vicinity of a telephone number candidate.

They are positive clues that improve the precision of the PhoneSimple rule by returning only those matches for the regular expression that are preceded within a number of tokens by one of these clues:
								 create dictionary PhoneClueDict as (
		'phone', 'call', 'at'
	);
	
	create view PhoneSimpleStrong as
	select P.*
	from PhoneSimple P
where ContainsDict('PhoneClueDict', LeftContextTok(num, 4));
							
You can also use these positive clues to improve the recall of the Phone extractor by discovering other telephone number patterns not yet captured by the PhoneSimple rule. For example, in the following AQL snippet, the view PhoneClueRightContext selects the context immediately succeeding a telephone number clue identified by the Pattern Discovery algorithm such as phone , call , or at that contains at least 3 consecutive digits but not a known telephone number candidate.
								 create view PhoneClue as 
	extract dictionary 'PhoneClueDict' on D.text as clue
	from Document D; 

	-- Right context of a phone clue containing digits
	create view PhoneClueContextAll as
	select P.clue as clue, RightContextTok(P.clue, 15) as rightCtx
	from PhoneClue P
	where ContainsRegex(/\d{3}/, RightContextTok(P.clue, 15)); 
	
	-- Right context of a phone clue with a known phone candidate
	create view PhoneClueContextKnownPhone as
	select P.*
	from PhoneClueContextAll P, PhoneSimple Ph
	where Contains(P.rightCtx, Ph.num);
	
	-- Right context of a phone clue without an already identified phone candidate
	create view PhoneClueRightContext as
	(select * from PhoneClueContextAll)
	minus
	(select * from PhoneClueContextKnownPhone);
	
output view PhoneClueRightContext;
							
Depending on the data collection, the view PhoneClueRightContext selects contexts that contain valid telephone numbers in other formats not yet captured by the PhoneSimple view, such as [Lorraine Becker at] x31680 or [Lorraine's phone] is x-1680 . You can then select example telephone numbers and use them as input to the Regular Expression Generator component to generate one or more regular expressions and capture these new telephone number formats in the PhoneSimple view:
									create view PhoneSimple as
	extract
 	regexes /\+?\(\d{3}\) ?\d{3}-\d{4}/ and /\+?\d{3}-\d{3}-\d{4}/ and /x-?\d{4,5}/
	    on D.text as num
	from Document D;
							
Example 3: 'phone;<PhoneSimple.num>'

This pattern indicates that a <PhoneSimple.num> match and the keyword phone commonly occur in the vicinity of a telephone number candidate.

This pattern is distilled from contexts such as 'Reach me at 555-123-4567 or ' where the occurrence of 555-123-4567 is replaced by the corresponding entity marker <PhoneSimple.num> .

Other scenarios for Pattern Discovery

Pattern Discovery is also useful in these contexts:
Known Entity Summarization

When you develop an extractor, you might have access to a list of known instances of an entity. You can use Pattern Discovery to identify common patterns that help you write more general rules to capture matches of the entity. The following actions illustrate this scenario:

You are developing an organization extractor, and you have access to a list of Fortune 1000 organizations. By using Pattern Discovery, you can identify a list of common organization suffixes from the Fortune 1000 organization list, such as 'Co.' and 'Corporation' . Create a dictionary of these strong organization suffixes. Use that dictionary in a rule to identify candidate organizations as one or more capitalized tokens followed by a match for the suffix in the dictionary.

Relationship extractor

When you write an extractor to identify relationships between entities, you can use Pattern Discovery to identify strong clues that are indicative of relationships in the input data set. The following actions illustrate this scenario:

Assume that you are developing an extractor to extract relationships between Person and Organization mentions. By using Pattern Discovery on the context between two mentions of these entities, you can discover common patterns, such as 'works for' , or 'is employed by' .