You can create your AQL script from a label that you have already
identified. It is a good practice to start from a lower-level label
(or bottom-up). For example, if you have a label that is called
Amount
and sublabels under that label called
Currency
,
Number
, and
Unit
, then you start creating AQL from the Currency,
Number, and Unit labels.
About this task
Begin with simple dictionaries or regular
expressions to identify all instances of the basic features that you
are interested in, then you add context by using clues to generate
good candidates and exclude false positives. Then you consolidate to
achieve high-quality results.
By this time in your analysis, you have identified
instances of the keywords or subjects as features that you are
interested in. In the processing of labeling, you selected examples
that might be positive or negative clues.
Begin your AQL script development by using simple rules to extract
instances of the keywords or basic features. This part of the
process is Step 3 in the
Extraction Tasks, Develop the
Extractor.
In modular InfoSphere® BigInsights™,
you package your extractor into a module that can be reused by other
modules that need your data.
Procedure
- Add
AQL for the basic features:
- Right-click
the root-label (or a sublabel).
- Click
. The Create
AQL Statement window opens.
- Complete
the fields.
- View Name
-
Views
are the primary data structures that are used with AQL. AQL
statements create views by selecting, extracting, and
transforming information from other views. AQL views are similar
to the views in a relational database. They have rows and
columns just like a database view and by default the views in
AQL are not materialized.
-
You reference input data as a special view called Document with one column called text. Each document in the set of
input data can be considered as one row in the Document view
with the document content mapped onto the text column.
- AQL Module
-
If you created your own module, type that module name,
otherwise, use the <label_name>_BasicFeatures
module name defaults
- AQL script
-
The AQL script name that identifies this script. The name must
have the AQL extension.
- Type
-
You can use several techniques to extract text elements. As a
beginning script, the type is usually Dictionary
or Regular Expression.
- Regular expression
-
Use a regular expression when you want to match text that is
based on a pattern.
A
regular expression
, also referred to as regex
or regexp, provides a
concise and flexible means for matching strings of text, such
as particular characters, words, or patterns of characters. A
regular expression is written in a formal language that can
be interpreted by a regular expression processor, a program
that examines text and identifies parts that match the
provided specification. For more information, see generating
a regular expression and building
a regular expression.
- Dictionary
-
Dictionaries are the most efficient extraction technique. Use
a dictionary when you can match on defined words.
Dictionaries
are lists or enumerations of terms. The template creates a
dictionary from an external file, but you can also code the
dictionary in-line, for example,
create dictionary MyDict as ('Finance');
This statement creates a dictionary with one
entry, ‘Finance’.
You can use an external dictionary file when you have many
entries. The external file makes it easier to add and change
entries without having to edit the code. For example, if you
are developing an extractor that extracts given names and
family names, you can collect given names and family names
and group them together in one or more dictionary (.dict) files. These dictionary
files can then be referenced in the extractor program to
identify occurrences of each of these entries in the input
documents.
By default, dictionaries are tokenized and internalized at
compile time, but you can use the
external dictionary
statement to switch dictionaries at run time.
- Part of speech
- You can identify locations of different
parts of speech across the input text.
- If
you want to show the view, click Output
View.
- If
you want to export this view to make it available for other
modules, click Export View.
When you click OK, the AQL script
in which you created the view is opened in the editor pane. The
appropriate templates for the type of statement that you selected
are appended to the file. For example, if you selected the Dictionary type, you see
the create dictionary and create view from dictionary statement
templates.
create dictionary <same_name_as_viewDict>
from file '<path to your dictionary here>'
with language as 'en';
create view <view_name> as
extract dictionary '<same_name_as_viewDict>'
on R.<input column> as match
from <input view> R;
output view <view_name>;
These define the statements in the example above.
-
create dictionary
- Creates a dictionary from a file.
-
create view
- Receive matches between the dictionary and the
input data by using an extract expression.
- input column and input view
-
The template uses input
column and input
view that must be edited. When you work with text documents from
a file system, use the special view
Document
to reference an input document. This view has a special column
text
that references the text of the input documents. Use Document.text to refer
to the contents of any input documents.
-
output view
-
Materializes the view. By default, views are not materialized.
During development you might want to use this statement to look at
the contents of intermediate views for debugging purposes. You can
comment out or delete the
output view
statements when they are no longer needed.
If your input documents are of XML or HTML type, you might need to
remove tags, in which case you can use the
detag
AQL statement to leave only the bare text. Put this statement at
the top of your *.aql file, below
the
module
statement. If you do use the
detag
statement, you
detag
from the Document.text
and put the results in a
<file_that_is_detagged>
. Then you must change the AQL template to extract from the
detagged file.
- Optional: If
your input documents are of XML or HTML type, you might need to
remove tags, in which case you can use the detag
AQL statement to leave only the bare text. Put this statement at
the top of your *.aql file, below
the module statement. If you do use
the detag statement, you detag from the Document.text and put the
results in a <file_that_is_detagged>.
Then you must change the AQL template to extract from the detagged
file.
- Add
AQL for generate candidates.
- Right-click
on the root-label (or a sub-label).
- Click
. The Create AQL Statement window
opens.
- Complete
the fields.
- View Name
-
Views
are the primary data structures used with AQL. AQL statements
create views by selecting, extracting, and transforming
information from other views. AQL views are similar to the views
in a relational database. They have rows and columns just like a
database view and, by default, the views in AQL are not
materialized.
-
You reference input data as a special view called Document with
one column called text. Each
document in the set of input data can be considered as one row
in the Document view with the document content mapped onto the
text column.
- AQL Module
-
If you created your own module, type that module name;
otherwise, use the <label_name>_CandidateFeatures
module name defaults.
- AQL Script
- The file name that identifies this script.
- Type
-
- Select
- Union All
- Block
- Pattern
For more information on the syntax, see the AQL Reference.
- If
you want to show the view, click Output
View.
- If
you want to export this view to make it available to other
modules click Export View.
The AQL script that you created (<file_name>.aql),
opens in the editor pane, with templates for the type that you
selected.
- Add
AQL statements to remove duplicates, and refine the output.
- Right-click
on the root-label (or a sub-label).
- Click
. The Create AQL Statement window
opens.
- Complete
the fields.
- View Name
- Views are the primary data structures used
with AQL. AQL statements create views by selecting, extracting
and transforming information from other views. AQL views are
similar to the views in a relational database. They have rows
and columns just like a database view and by default the views
in AQL are not materialized.
-
You reference input data through a special view called Document with one column called text. Each document in the set of
input data can be considered as one row in the Document view
with the document content mapped onto the text column.
- AQL Module
-
If you created your own module, type that module name,
otherwise, use the <label_name>_CandidateFeatures
module name defaults.
- AQL script
- The file name that is used to identify this
script.
- Type
-
- Consolidate
- Predicate-based Filter
- Set-based Filter
For more information on the syntax, see the AQL Reference.
- Finalize
AQL, and create the run configuration.
- Remove
any output views from the *.aql
files that you included by adding a comment (--) or deleting the output view statements.
At this level, you are building a module for others to use. They
can add output view statements to
their local code.
- Externalize
any local dictionaries so consumers of your module can customize
this dictionary using their own terms. The best way to do this is
to put the external dictionary definition in a separate module
and *.aql file.
- Click
to create a module.
- In the New
window, select AQL Module, and
click Next.
- In the New
AQL Module window, specify the project name and the module
name, and click Finish.
In the Project Explorer, you see an additional module in this
path: <project_name>\textAnalytics\src\<new_module>
.
- Add a script to your new module to contain
the external dictionary declaration. In Project Explorer,
right-click the new module name, and select .
- In the New
window, select AQL script, and
select Next.
- In the New
AQL Script window, specify the project name, the new module
name, and a name for the script <new_name.aql>
, then click Finish.
- Create
the external dictionary and complete the AQL file with an export statement. The export statement
exports the dictionary so that is visible outside the current
module. A dictionary file is associated with the external
dictionary at run time.
Return to your consolidated level of AQL, and edit the AQL file
to refer to the external dictionary.
import dictionary <external dictionary name> from module <module name that contains the external dictionary> as <some name that can be used as an alias in local aql files>;
- In
the AQL file in the consolidate level, change any dictionary
references to the alias name that you are importing.
- Now
create the run configuration.
- Export
the module and set up a library of reusable modules.
- Create a run configuration to associate a
file with the external dictionary. From the file menu, click .
Right-click Text Analytics,
and select New.
- In the Name
field, specify a name for this configuration.
- In the Main
page, specify the project name in the Project
field.
- In the Select
Modules field, you see all the modules that are created in the
current project. Select the module where you have your AQL
statements for the consolidation work.
- In the Location
for the data collection field, browse the workspace to find the
data that you used in your project.
- Open the External
Dictionaries page. Browse the workspace to find the original
dictionary file that you used in your AQL file, and click OK. This associates the dictionary
that contains your text with an external dictionary declaration.
- Click Run
to process this configuration. You might get errors because
there are no enabled outputs, but this is acceptable for the
purposes of exporting a module.
- In Project Explorer, right-click your
project, and select Export.
- In the Export
window, expand BigInsights,
click Export Text Analytics
Extractor, and click Next
- In the Export
Extractor window, select the working module, and select the Export dependent modules check box.
to ensure that you include the external dictionary module along
with your working module.
- In the Select
the export destination directory, browse the file system or
your workspace, and select a destination for the export.
- Specify whether to export to the destination
directory or to a jar or zip file in the destination directory.
- If you select the Export
to a jar or zip archive under the destination directory radio
button, provide a file name for the archive.
- Click Finish.