Writing the AQL to extract your labeled examples

You can create your AQL script from a label that you have already identified. It is a good practice to start from a lower-level label (or bottom-up). For example, if you have a label that is called Amount and sublabels under that label called Currency , Number , and Unit , then you start creating AQL from the Currency, Number, and Unit labels.

About this task

Begin with simple dictionaries or regular expressions to identify all instances of the basic features that you are interested in, then you add context by using clues to generate good candidates and exclude false positives. Then you consolidate to achieve high-quality results.

By this time in your analysis, you have identified instances of the keywords or subjects as features that you are interested in. In the processing of labeling, you selected examples that might be positive or negative clues.

Begin your AQL script development by using simple rules to extract instances of the keywords or basic features. This part of the process is Step 3 in the Extraction Tasks, Develop the Extractor.

In modular InfoSphere® BigInsights™, you package your extractor into a module that can be reused by other modules that need your data.

Procedure

  1. Add AQL for the basic features:
    1. Right-click the root-label (or a sublabel).
    2. Click New AQL Statement > Basic Feature AQL Statement. The Create AQL Statement window opens.
    3. Complete the fields.
      View Name
      Views are the primary data structures that are used with AQL. AQL statements create views by selecting, extracting, and transforming information from other views. AQL views are similar to the views in a relational database. They have rows and columns just like a database view and by default the views in AQL are not materialized.
      You reference input data as a special view called Document with one column called text. Each document in the set of input data can be considered as one row in the Document view with the document content mapped onto the text column.
      AQL Module
      If you created your own module, type that module name, otherwise, use the <label_name>_BasicFeatures module name defaults
      AQL script
      The AQL script name that identifies this script. The name must have the AQL extension.
      Type
      You can use several techniques to extract text elements. As a beginning script, the type is usually Dictionary or Regular Expression.
      Regular expression
      Use a regular expression when you want to match text that is based on a pattern.

      A regular expression , also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that examines text and identifies parts that match the provided specification. For more information, see generating a regular expression and building a regular expression.

      Dictionary
      Dictionaries are the most efficient extraction technique. Use a dictionary when you can match on defined words. Dictionaries are lists or enumerations of terms. The template creates a dictionary from an external file, but you can also code the dictionary in-line, for example,
      											create dictionary MyDict as ('Finance');
      										

      This statement creates a dictionary with one entry, ‘Finance’.

      You can use an external dictionary file when you have many entries. The external file makes it easier to add and change entries without having to edit the code. For example, if you are developing an extractor that extracts given names and family names, you can collect given names and family names and group them together in one or more dictionary (.dict) files. These dictionary files can then be referenced in the extractor program to identify occurrences of each of these entries in the input documents.

      By default, dictionaries are tokenized and internalized at compile time, but you can use the external dictionary statement to switch dictionaries at run time.

      Part of speech
      You can identify locations of different parts of speech across the input text.
    4. If you want to show the view, click Output View.
    5. If you want to export this view to make it available for other modules, click Export View.
    When you click OK, the AQL script in which you created the view is opened in the editor pane. The appropriate templates for the type of statement that you selected are appended to the file. For example, if you selected the Dictionary type, you see the create dictionary and create view from dictionary statement templates.
    					create dictionary <same_name_as_viewDict> 
    from file  '<path to your dictionary here>'
    with language as 'en';
    
    create view <view_name> as 
    extract dictionary '<same_name_as_viewDict>'
    on  R.<input column> as match
    from <input view> R;
    
    output view <view_name>;
    				

    These define the statements in the example above.

    create dictionary
    Creates a dictionary from a file.
    create view
    Receive matches between the dictionary and the input data by using an extract expression.
    input column and input view
    The template uses input column and input view that must be edited. When you work with text documents from a file system, use the special view Document to reference an input document. This view has a special column text that references the text of the input documents. Use Document.text to refer to the contents of any input documents.
    output view
    Materializes the view. By default, views are not materialized. During development you might want to use this statement to look at the contents of intermediate views for debugging purposes. You can comment out or delete the output view statements when they are no longer needed.

    If your input documents are of XML or HTML type, you might need to remove tags, in which case you can use the detag AQL statement to leave only the bare text. Put this statement at the top of your *.aql file, below the module statement. If you do use the detag statement, you detag from the Document.text and put the results in a <file_that_is_detagged> . Then you must change the AQL template to extract from the detagged file.

  2. Optional: If your input documents are of XML or HTML type, you might need to remove tags, in which case you can use the detag AQL statement to leave only the bare text. Put this statement at the top of your *.aql file, below the module statement. If you do use the detag statement, you detag from the Document.text and put the results in a <file_that_is_detagged>. Then you must change the AQL template to extract from the detagged file.
  3. Add AQL for generate candidates.
    1. Right-click on the root-label (or a sub-label).
    2. Click New AQL Statement > Candidate Generation AQL Statement. The Create AQL Statement window opens.
    3. Complete the fields.
      View Name
      Views are the primary data structures used with AQL. AQL statements create views by selecting, extracting, and transforming information from other views. AQL views are similar to the views in a relational database. They have rows and columns just like a database view and, by default, the views in AQL are not materialized.
      You reference input data as a special view called Document with one column called text. Each document in the set of input data can be considered as one row in the Document view with the document content mapped onto the text column.
      AQL Module
      If you created your own module, type that module name; otherwise, use the <label_name>_CandidateFeatures module name defaults.
      AQL Script
      The file name that identifies this script.
      Type
      • Select
      • Union All
      • Block
      • Pattern
      For more information on the syntax, see the AQL Reference.
    4. If you want to show the view, click Output View.
    5. If you want to export this view to make it available to other modules click Export View.
    The AQL script that you created (<file_name>.aql), opens in the editor pane, with templates for the type that you selected.
  4. Add AQL statements to remove duplicates, and refine the output.
    1. Right-click on the root-label (or a sub-label).
    2. Click New AQL Statement > Filter and Consolidate AQL Statement. The Create AQL Statement window opens.
    3. Complete the fields.
      View Name
      Views are the primary data structures used with AQL. AQL statements create views by selecting, extracting and transforming information from other views. AQL views are similar to the views in a relational database. They have rows and columns just like a database view and by default the views in AQL are not materialized.
      You reference input data through a special view called Document with one column called text. Each document in the set of input data can be considered as one row in the Document view with the document content mapped onto the text column.
      AQL Module
      If you created your own module, type that module name, otherwise, use the <label_name>_CandidateFeatures module name defaults.
      AQL script
      The file name that is used to identify this script.
      Type
      • Consolidate
      • Predicate-based Filter
      • Set-based Filter
      For more information on the syntax, see the AQL Reference.
  5. Finalize AQL, and create the run configuration.
    1. Remove any output views from the *.aql files that you included by adding a comment (--) or deleting the output view statements. At this level, you are building a module for others to use. They can add output view statements to their local code.
    2. Externalize any local dictionaries so consumers of your module can customize this dictionary using their own terms. The best way to do this is to put the external dictionary definition in a separate module and *.aql file.
      1. Click File > New > Other to create a module.
      2. In the New window, select AQL Module, and click Next.
      3. In the New AQL Module window, specify the project name and the module name, and click Finish.

      In the Project Explorer, you see an additional module in this path: <project_name>\textAnalytics\src\<new_module> .

      1. Add a script to your new module to contain the external dictionary declaration. In Project Explorer, right-click the new module name, and select New > Other.
      2. In the New window, select AQL script, and select Next.
      3. In the New AQL Script window, specify the project name, the new module name, and a name for the script <new_name.aql> , then click Finish.
    3. Create the external dictionary and complete the AQL file with an export statement. The export statement exports the dictionary so that is visible outside the current module. A dictionary file is associated with the external dictionary at run time.
      Return to your consolidated level of AQL, and edit the AQL file to refer to the external dictionary.
      								import dictionary <external dictionary name> from module <module name that contains the external dictionary> as <some name that can be used as an alias in local aql files>;
      							
    4. In the AQL file in the consolidate level, change any dictionary references to the alias name that you are importing.
    5. Now create the run configuration.
    6. Export the module and set up a library of reusable modules.
      1. Create a run configuration to associate a file with the external dictionary. From the file menu, click Run > Run Configurations. Right-click Text Analytics, and select New.
      2. In the Name field, specify a name for this configuration.
      3. In the Main page, specify the project name in the Project field.
      4. In the Select Modules field, you see all the modules that are created in the current project. Select the module where you have your AQL statements for the consolidation work.
      5. In the Location for the data collection field, browse the workspace to find the data that you used in your project.
      6. Open the External Dictionaries page. Browse the workspace to find the original dictionary file that you used in your AQL file, and click OK. This associates the dictionary that contains your text with an external dictionary declaration.
      7. Click Run to process this configuration. You might get errors because there are no enabled outputs, but this is acceptable for the purposes of exporting a module.
      1. In Project Explorer, right-click your project, and select Export.
      2. In the Export window, expand BigInsights, click Export Text Analytics Extractor, and click Next
      3. In the Export Extractor window, select the working module, and select the Export dependent modules check box. to ensure that you include the external dictionary module along with your working module.
      4. In the Select the export destination directory, browse the file system or your workspace, and select a destination for the export.
      5. Specify whether to export to the destination directory or to a jar or zip file in the destination directory.
      6. If you select the Export to a jar or zip archive under the destination directory radio button, provide a file name for the archive.
      7. Click Finish.