Counterfactual Analysis

Introduction

Counterfactual Analysis helps to predict outcomes when one of the inputs is changed. For example, it can help predict customer demands if prices change, or predict disease outcome if a treatment is taken. This is different from ‘What-if’ analysis of predictive models that generates results by predicting outcomes from a new input value.

To understand the difference between our counterfactual analysis and ‘What-if’ analysis, it is important to understand the difference between an intervention and an observed value: an intervention is an action taken that changes one of the input values, while an observed value is simply what is observed and we do not know the source or cause of the change. To predict the outcome of an intervention, it is essential to estimate the causal effect of the intervention on the outcome. Hence, counterfactual analysis provides an understanding of how an outcome is affected by changes to input values via an intervention, and similarly, what changes to the input values would change a model’s outcomes and decisions.

In order to avoid biases due to spurious correlations, common causes (also known as confounders, which affect both the treatment and outcome variables) are first disassociated from both the given treatment variables and the outcome variable. It then finds the association between the disassociated treatment and outcome variables. This process is done using Double Machine Learning, which first generates two predictive models (to predict the outcome from the confounder variables, and to predict the treatment from the confounders), that are then combined to create a model of the treatment effects.

To run a Counterfactual Analysis, the following are required:

  • A dataset that contains the outcome

  • The values of the treatment variable

  • The new values of the treatment variable

  • The common causes that have causal effects on both the treatment and outcome

Parameters

  • Target: The variable on which we want to measure the effects of the intervention, a.k.a. the outcome. It can only be numerical.

  • Current Intervention: Variables that cause the effect of interest, a.k.a. the current treatment affecting the outcome. It can be either numeric, Boolean, or categorical.

  • New Intervention: New variable that will generate a different effect on the outcome, a.k.a. the new treatment affecting the outcome. It can be either numeric, Boolean, or categorical. It must be the same type as the Current Intervention.

  • Cross-validation: Number of folds to use in cross-validation for the double machine learning algorithm. A larger value leads to better results but makes the analysis slower. Default value is 5.

  • Common causes: Also known as confounders, these are variables that can affect both the treatment and the outcome. Selecting good common causes can help improve the results. The columns that have an effect on both the target and the current intervention should be included.

  • Alpha level: Value used to calculate the lower and upper bounds of the effect. For example, if the alpha effect is 0.05, the lower bound and upper bound would create a 95% confidence interval for the effect. Default is 0.05.

  • Filters (optional): Set conditions on the columns to be filtered in the original dataset. If selected, only a subset of the original data will be used in the analytics.

Furthermore, if any features are related to time, they can also be used for visualization:

  • Time Column: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined here.

  • Time Range: The time range used for visualization, where the time is based on the Time Column. A number of pre-set values are provided, such as Last day, Last week, Last month, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server’s local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format.

Result View

  • Table: A table showing the counterfactual effect of the treatment on the outcome. The columns are the same as the original data set but new columns are added to show the counterfactual effect:

    • The <target>_intervened column shows the outcome after the intervention, where <target> represents the name of the Target variable.

    • The <target>_intervened_high and <target>_intervened_low columns show the upper and lower bounds, respectively, of the counterfactual effect.

    • The intervention_effect column shows the difference between the intervened outcome and the original outcome.

    • The intervention_effect_high and intervention_effect_low columns show the difference between the intervened outcome and the original outcome’s upper and lower bounds, respectively, of the counterfactual effect.

  • Validation: Shows the validation results of the double machine learning algorithm:

    • The treatment model infers the treatment from the common causes.

    • The outcome model infers the outcome from the common causes.

    • A final model is then used on the residuals of the treatment and outcome models to infer the counterfactual effect.

    • The Feature-Importance tables show the importance of each variable in the models.

    • The residual plot shows the residuals of the outcome model on the \(y\)-axis, and the residuals of the treatment model on the \(x\)-axis.

  • Causal graph: A Directed Acyclic Graph (DAG) that visualizes the parameter choices based on user selection of treatment, outcome and common causes. A DAG (an example of which is shown below) can be interpreted as follows:

    • An edge represents causality between variables.

    • Gray color represents the Common causes variables.

    • White color represents the Current Intervention variable.

    • Red color represents the Outcome variable.

  • Plot:

    • The first graph depicts the intervention chart, namely a plot of the original target on the \(x\)-axis, and the intervened target on the \(y\)-axis. If the target increases, the point will be above the identity line (the line on which the original target and new intervened target are equal). If the target decreases, the point will be below the identity line. If the intervention is categorical, the points will have colors corresponding to each pair of categories (current_intervention -> new_intervention). If the intervention is numeric, the point will have a color corresponding to the value of the difference of the interventions (new_intervention - current_intervention).

    • The second plot is the effect plot; if numerical, the \(x\)-axis is the difference of the interventions, whereas if categorical, the \(x\)-axis contains each pair of categories. The \(y\)-axis is the intervention effect (new_target - original_target).

Case Study

Note

This example is available in the Actable AI web app and may be viewed here.

Suppose that we are a real estate agent and we would like to see how the price of a house changes if the number of rooms is increased or decreased. To do this, we use an existing dataset of houses and add a new column called new_number_of_rooms representing the number of rooms after the intervention, where our intervention is number_of_rooms + random.choice([-1, 1]) (i.e. the number of rooms is randomly increased or decreased by 1).

An example of the dataset is as follows:

number_of_rooms

new_number_of_rooms

sqft

rental_price

0

1

484.8

2271

1

2

674

2167

1

0

554

1883

0

1

529

2431

3

2

1219

5510

1

2

398

2272

3

2

1190

4123.812

To run the counterfactual analysis, we need to specify the following parameters:

  • The target is the column rental_price.

  • The current intervention is the number_of_rooms.

  • The new intervention is the column new_number_of_rooms.

  • The common cause is the column sqft.

_images/case_study_parameters.png

After pressing the Run button, the following will be displayed:

Table

A table containing the counterfactual effect of the treatment on the outcome for each appartment.

_images/case_study_table.png

This table can be downloaded by clicking on the Download button.

_images/case_study_download.png

Validation

This tab shows the validation results for the double machine learning models:

_images/case_study_importance.png

It can be observed that the treatment model and the outcome model have been able to infer the common causes, meaning that the rental price and the number of rooms can be easily found with the sqft feature alone.

_images/case_study_residuals.png

In the above image, the residuals of the two models can be observed.

Causal Graph

_images/case_study_graph.png

The causal graph helps to visualize the causal relationships between the variables. From the graph, we can see that sqft is a cause of the rental_price and number_of_rooms, and that number_of_rooms is also a cause of the rental_price.

Intervention Plot

_images/case_study_effect.png

The \(x\)-axis represents the original rental price, while the \(y\)-axis represents the predicted rental price after the intervention. The color of each point corresponds to the difference in the number of rooms: a red point indicates that the number of rooms was incremented by 1, while a blue point indicates that the number of rooms was decreased by 1.

It can be observed that the rental price is generally higher when the number of rooms increases, as could be expected.

_images/case_study_inteff.png

In this plot, we can see the range of the effect of the intervention. The \(x\)-axis contains the intervention difference (new_number_of_rooms - number_of_rooms) and the \(y\)-axis represents the effect of the intervention (rental_price - rental_price_intervened). It can also be observed from this plot that decreasing the number of rooms always results in a decrease in the price, whereas increasing the number of rooms generally results in a higher price.