.. _counterfactual: Counterfactual Analysis ======================= Introduction ------------ *Counterfactual Analysis* helps to predict outcomes when one of the inputs is changed. For example, it can help predict customer demands if prices change, or predict disease outcome if a treatment is taken. This is different from 'What-if' analysis of predictive models that generates results by predicting outcomes from a new input value. To understand the difference between our counterfactual analysis and 'What-if' analysis, it is important to understand the difference between an *intervention* and an *observed value*: an *intervention* is an action taken that changes one of the input values, while an *observed value* is simply what is observed and we do not know the source or cause of the change. To predict the outcome of an intervention, it is essential to estimate the *causal effect* of the intervention on the outcome. Hence, counterfactual analysis provides an understanding of *how* an outcome is affected by changes to input values via an intervention, and similarly, *what changes* to the input values would change a model's outcomes and decisions. In order to avoid biases due to spurious correlations, *common causes* (also known as :ref:`confounders`, which affect both the :ref:`treatment` and :ref:`outcome` variables) are first disassociated from both the given :ref:`treatment` variables and the :ref:`outcome` variable. It then finds the association between the disassociated :ref:`treatment` and :ref:`outcome` variables. This process is done using :ref:`gls_double_ml`, which first generates two predictive models (to predict the :ref:`outcome` from the :ref:`confounder` variables, and to predict the :ref:`treatment` from the :ref:`confounders`), that are then combined to create a model of the treatment effects. To run a Counterfactual Analysis, the following are required: * A dataset that contains the outcome * The values of the treatment variable * The *new* values of the treatment variable * The :ref:`common causes` that have causal effects on both the treatment and outcome Parameters ---------- * **Target**: The variable on which we want to measure the effects of the intervention, a.k.a. the outcome. It can only be numerical. * **Current Intervention**: Variables that cause the effect of interest, a.k.a. the current treatment affecting the outcome. It can be either numeric, Boolean, or categorical. * **New Intervention**: New variable that will generate a different effect on the outcome, a.k.a. the new treatment affecting the outcome. It can be either numeric, Boolean, or categorical. It must be the same type as the **Current Intervention**. * **Cross-validation**: Number of folds to use in cross-validation for the double machine learning algorithm. A larger value leads to better results but makes the analysis slower. Default value is 5. * **Common causes**: Also known as :ref:`confounders`, these are variables that can affect both the treatment *and* the outcome. Selecting good common causes can help improve the results. The columns that have an effect on both the target and the current intervention should be included. * **Alpha level**: Value used to calculate the lower and upper bounds of the effect. For example, if the alpha effect is 0.05, the lower bound and upper bound would create a 95% confidence interval for the effect. Default is 0.05. * **Filters** (optional): Set conditions on the columns to be filtered in the original dataset. If selected, only a subset of the original data will be used in the analytics. Furthermore, if any features are related to time, they can also be used for visualization: * **Time Column**: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined `here `_. * **Time Range**: The time range used for visualization, where the time is based on the **Time Column**. A number of pre-set values are provided, such as *Last day*, *Last week*, *Last month*, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server's local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format. Result View ----------- * **Table**: A table showing the counterfactual effect of the treatment on the outcome. The columns are the same as the original data set but new columns are added to show the counterfactual effect: * The ``_intervened`` column shows the outcome after the intervention, where `` represents the name of the **Target** variable. * The ``_intervened_high`` and ``_intervened_low`` columns show the upper and lower bounds, respectively, of the counterfactual effect. * The ``intervention_effect`` column shows the difference between the intervened outcome and the original outcome. * The ``intervention_effect_high`` and ``intervention_effect_low`` columns show the difference between the intervened outcome and the original outcome's upper and lower bounds, respectively, of the counterfactual effect. * **Validation**: Shows the validation results of the double machine learning algorithm: * The *treatment model* infers the treatment from the common causes. * The *outcome model* infers the outcome from the common causes. * A final model is then used on the residuals of the treatment and outcome models to infer the counterfactual effect. * The *Feature-Importance* tables show the importance of each variable in the models. * The *residual plot* shows the residuals of the outcome model on the :math:`y`-axis, and the residuals of the treatment model on the :math:`x`-axis. * **Causal graph**: A Directed Acyclic Graph (DAG) that visualizes the parameter choices based on user selection of treatment, outcome and common causes. A DAG (an example of which is shown below) can be interpreted as follows: * An edge represents *causality* between variables. * Gray color represents the *Common causes* variables. * White color represents the *Current Intervention* variable. * Red color represents the *Outcome* variable. * **Plot**: * The first graph depicts the *intervention chart*, namely a plot of the original target on the :math:`x`-axis, and the intervened target on the :math:`y`-axis. If the target increases, the point will be above the *identity line* (the line on which the original target and new intervened target are equal). If the target decreases, the point will be below the identity line. If the intervention is categorical, the points will have colors corresponding to each pair of categories (``current_intervention`` -> ``new_intervention``). If the intervention is numeric, the point will have a color corresponding to the value of the difference of the interventions (``new_intervention`` - ``current_intervention``). * The second plot is the *effect plot*; if numerical, the :math:`x`-axis is the difference of the interventions, whereas if categorical, the :math:`x`-axis contains each pair of categories. The :math:`y`-axis is the intervention effect (``new_target`` - ``original_target``). Case Study ---------- .. note:: This example is available in the Actable AI web app and may be viewed `here `_. Suppose that we are a real estate agent and we would like to see how the price of a house changes if the number of rooms is increased or decreased. To do this, we use an existing dataset of houses and add a new column called ``new_number_of_rooms`` representing the number of rooms after the intervention, where our intervention is ``number_of_rooms + random.choice([-1, 1])`` (i.e. the number of rooms is randomly increased or decreased by 1). An example of the dataset is as follows: +------------------+----------------------+-------+---------------+ | number_of_rooms | new_number_of_rooms | sqft | rental_price | +==================+======================+=======+===============+ | 0 | 1 | 484.8 | 2271 | +------------------+----------------------+-------+---------------+ | 1 | 2 | 674 | 2167 | +------------------+----------------------+-------+---------------+ | 1 | 0 | 554 | 1883 | +------------------+----------------------+-------+---------------+ | 0 | 1 | 529 | 2431 | +------------------+----------------------+-------+---------------+ | 3 | 2 | 1219 | 5510 | +------------------+----------------------+-------+---------------+ | 1 | 2 | 398 | 2272 | +------------------+----------------------+-------+---------------+ | 3 | 2 | 1190 | 4123.812 | +------------------+----------------------+-------+---------------+ To run the counterfactual analysis, we need to specify the following parameters: * The target is the column ``rental_price``. * The current intervention is the ``number_of_rooms``. * The new intervention is the column ``new_number_of_rooms``. * The common cause is the column ``sqft``. .. image:: images/analytics/counterfactual/case_study_parameters.png :scale: 75% :align: center After pressing the **Run** button, the following will be displayed: Table """"" A table containing the counterfactual effect of the treatment on the outcome for each appartment. .. image:: images/analytics/counterfactual/case_study_table.png :scale: 25% :align: center This table can be downloaded by clicking on the **Download** button. .. image:: images/analytics/counterfactual/case_study_download.png :scale: 100% :align: center Validation """""""""" This tab shows the validation results for the double machine learning models: .. image:: images/analytics/counterfactual/case_study_importance.png :scale: 100% :align: center It can be observed that the treatment model and the outcome model have been able to infer the common causes, meaning that the rental price and the number of rooms can be easily found with the ``sqft`` feature alone. .. image:: images/analytics/counterfactual/case_study_residuals.png :scale: 40% :align: center In the above image, the residuals of the two models can be observed. Causal Graph """""""""""" .. image:: images/analytics/counterfactual/case_study_graph.png :scale: 40% :align: center The causal graph helps to visualize the causal relationships between the variables. From the graph, we can see that ``sqft`` is a cause of the ``rental_price`` and ``number_of_rooms``, and that ``number_of_rooms`` is also a cause of the ``rental_price``. Intervention Plot """"""""""""""""" .. image:: images/analytics/counterfactual/case_study_effect.png :scale: 70% :align: center The :math:`x`-axis represents the original rental price, while the :math:`y`-axis represents the predicted rental price after the intervention. The color of each point corresponds to the difference in the number of rooms: a red point indicates that the number of rooms was incremented by 1, while a blue point indicates that the number of rooms was decreased by 1. It can be observed that the rental price is generally higher when the number of rooms increases, as could be expected. .. image:: images/analytics/counterfactual/case_study_inteff.png :scale: 70% :align: center In this plot, we can see the range of the effect of the intervention. The :math:`x`-axis contains the intervention difference ``(new_number_of_rooms - number_of_rooms)`` and the :math:`y`-axis represents the effect of the intervention ``(rental_price - rental_price_intervened)``. It can also be observed from this plot that decreasing the number of rooms always results in a decrease in the price, whereas increasing the number of rooms generally results in a higher price.