Information Extraction

Introduction

Text data can be intelligently processed to extract and organise any required information. For instance, given text obtained from an invoice (e.g. using Actable AI’s Optical Character Recognition (OCR) analytic), information about the invoice number, seller, item names, quantities, prices, and so on could be retrieved. This organised information can then be further utilized in a number of ways, such as to compile statistics on the items purchased, distributions of how money was spent, etc.

Actable AI uses the latest cutting-edge Large Language Models (LLMs) provided by OpenAI to interpret the textual information and retrieve the desired information. However, an in-house model which does not require the need for any data to be sent to OpenAI is currently in development.

Note

A video demonstration using the OCR and information extraction analytics may be viewed here.

Parameters

There are two tabs containing options that can be set, namely the Data tab and the Information Extraction tab, as follows:

  • Data tab:

    • Filters (optional): Set conditions on columns to be filtered in the original dataset. If selected, only a subset of the original data is used in the analytics.

    • Text column: The name of the column in the table containing the document text to be analyzed.

    • Document name column: The name of the column in the table corresponding to the document file name.

    Furthermore, if any features are related to time, they can also be used for visualization:

    • Time Column: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined here.

    • Time Range: The time range used for visualization, where the time is based on the Time Column. A number of pre-set values are provided, such as Last day, Last week, Last month, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server’s local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format.

  • Information Extraction tab:

    This tab contains all of the available model types organized in sections. Currently, only OpenAI-based models are available. However, more models will be added in the future.

    • Information to Extract: The names/keywords of the information to be extracted. For example, invoice_no will enable the model to search for the invoice number and assign it to a variable based on the provided JSON schema (described below). Note that the names do not need to be exact, since the LLMs can attempt to infer the meaning of the provided variable names in order to extract the correct information.

    • Model: The model to use for feature extraction. CUrrently supported models include GPT 3.5 Turbo, GPT 3.5 Turbo 16K, GPT 4, and GPT 4 32K. More models will be added in the future.

    • Output JSON Schema: The schema (format) used for the output of the LLM. For each vaiable listed in the Information to Extract field, its type must be provided. These are typically str (for string/text-based variables), int (for variables containing integers), and float (for variables containing floating points/decimal numbers). Lists may also be defined, containing additional variables. More information will be provided in the example below.

    • Rate Limit Per Minute: Limit on the rate to be used when calling the OpenAI API.

    • OpenAI API Key: The key to use when using the OpenAI API.

Case Study

Note

This example is available in the Actable AI web app and may be viewed here.

Suppose we have text from invoices, and would like to determine the invoice number, name of the seller, and information on the items purchased (such as their name, quantity, and price). We set our parameters as follows:

  • Data tab:

    _images/info_extraction_setup_data.png
  • Information Extraction tab:

    _images/info_extraction_setup_info_extraction.png

As can be observed, the following variables are defined:

  • invoice_no: The invoice number, set to int type (an integer/whole number).

  • seller_name: The name of the seller, set to str (a string, since names are typically in the form of text).

  • items: Since there may be multiple items listed in an invoice, this variable takes the form of a list that defined a set of variables for each item, namely:

    • name: The name of the item, set to str type (since names are text-based).

    • quantity: The quantity of each item, set to int (a whole number).

    • price: The price of one item, set to float (a decimal number).

    • total_price: The total price of the items (the unit price multiplied by the quantity), set to float.

Running the analytic will attempt to retrieve the desired information and store it in the format defined by the JSON schema. An example of extracted information is shown below:

_images/results.png

As can be observed, the resultant table contains one row for each image. The extracted_data_raw column contains the extracted information according to the defined JSON schema.