Counterfactual Analysis

Introduction

Counterfactual Analysis helps predict outcomes if we take action to change one of the inputs. For example, it can help predict demands if prices change or predict disease outcome if a treatment is taken. This is different from What-if analysis of predictive models that generates results by predicting outcomes from a new input value.

To understand the difference between our counterfactual analysis and What-if analysis, it is important to understand the difference between an intervention and an observed value. An intervention is an action taken that changes one of the input values while an observed value is simply what is observed and we don’t know where the change comes from. To predict the outcome of an intervention, it is essential to estimate its causal effect on the outcome.

To run a Counterfactual Analysis, you need a data set that contains the outcome, the treatment, the new treatment and the different common causes that have causal effetcs on both treatment and outcome.

Parameters

  • Target: Variable on which we want to measure the effects of the intervention a.k.a the outcome. It can only be numerical.
  • Current Intervention: Variable that causes the effect of interest a.k.a the current treatment affecting the outcome. It can be either numeric, boolean, or categorical.
  • New Intervention: New variable that would generate a different effect on the outcome a.k.a the new treatment affecting the outcome. It can be either numeric, boolean, or categorical. It must be the same type as the current intervention.
  • Cross-validation: Number of folds to use in cross-validation for the double machine learning algorithm. Higher value leads to better results but makes the analysis slower. Default is 5.
  • Common causes: Common causes are also known as confounders. These are variables that can affect both the treatment and outcome. Selecting good common causes would help improve the analytics. The value selection is important for the results. Please include the columns that have an effect on the target and the current intervention.
  • Alpha Effect: Value used to calculate the low and high bound of the effect. For example if the alpha effect is 0.05, the low bound and high bound would create a 95% confidence interval for the effect. Default is 0.05.
  • Filters (optional): Set conditions on columns to filter on the original dataset. If selected, only a subset of the original data would be used in the analytics.

Result View

  • Counterfactual: Table showing the counterfactual effect of the treatment on the outcome. The columns are the same as the original data set but new columns are added to show the counterfactual effect. The <target>_intervened column shows the outcome after the intervention. The intervention_effect column shows the difference between the intervened outcome and the original outcome. The <target>_intervened_[high|low] columns show respectively the high and low bounds of the counterfactual effect. The intervention_effect_[high|low] columns show respectively the difference between the intervened outcome and the original outcome high and low bounds of the counterfactual effect.
  • Validation: Shows the validation results of the double machine learning algorithm. The treatment model infers the treatment from the common causes. The outcome model infers the outcome from the common causes. Then a final model is used on the residuals of those models to infer the counterfactual effect. The feature importance shows the importance of each variable in the model. Finally the residual plot shows the residuals of the outcome model and the residuals of the treatment model.
  • Causal Graph: A directed acyclic graph (DAG) that visualise the parameters choices based on user selection of treatment, outcome and common causes. The edge means the causality between variables. The yellow colour stands for the Treatment, white colour stands for the Common causes and red colour stands for the Outcome.
  • Intervention Plot: First plot is the plot of the original target on x and the intervened target on y. If the target increased, the point will be above the identity line. If the target decreased, the point will be below the identity line. If the intervention is categorical, the point will have colors corresponding to each pair of categories (current_intervention -> new_intervention). If the intervention is numeric, the point will have a color corresponding to the value of the difference of the interventions (new_intervention - current_intervention). Second plot is the effect plot, if numerical the x axis is the difference of the interventions, if categorical the x axis is each pair of categories. The y axis is the intervention effect (new_target - original_target).

Case Study

Imagine we are a real estate agent and we would like to see how the price of a house changes if the number of rooms is increased or decreased. To do this we take an existing dataset of houses and we add a new column called new_number_of_rooms which is the number of rooms after the intervention, our intervention is number_of_rooms + random.choice([-1, 1]) (the number of rooms is randomly increased or decreased by 1).

An example of the dataset could be:

number_of_rooms new_number_of_rooms sqft rental_price
0 1 484,8 2271
1 2 674 2167
1 0 554 1883
0 1 529 2431
3 2 1219 5510
1 2 398 2272
3 2 1190 4123.812

To run the counterfactual analysis, we need to specify the following parameters:

  • The target is the column rental_price.
  • The current intervention is the number_of_rooms.
  • The new intervention is the column new_number_of_rooms.
  • The common cause is the column sqft.
_images/case_study_parameters.png

After pressing the Run button, the following will be displayed:

Table

A table containing the counterfactual effect of the treatment on the outcome for each appartment.

_images/case_study_table.png

This table can be downloaded by clicking on the Download button.

_images/case_study_download.png

Validation

The validation for the double machine learning models.

_images/case_study_importance.png

Here we can see that the treatment model and the outcome model have been able to infer the common causes. Which means the rental price and the number of rooms can be easily found with the sqft alone.

_images/case_study_residuals.png

Here we observe the residuals of the two models.

Causal Graph

_images/case_study_graph.png

The causal graph helps visualize the causal relationships between the variables. From the graph we can see that sqft is a cause of the rental_price and number_of_rooms and that number_of_rooms is a cause of the rental_price.

Intervention Plot

_images/case_study_effect.png

The x axis is the original rental price, the y axis is the predicted rental price after the intervention. The color of the point corresponds to the difference of number of rooms. When the point is red we added 1 to the number of rooms. When the point is blue we substracted 1 to the number of rooms. We can see that in general the predicted rental price is higher when the number of rooms becomes higher except when the rental price is originally low. We might hypothesize that when an appartment is cheap, there is not enough size to add a new room which will decrease the price. We can also see that the effect of adding or removing a room is stronger when the original price is higher.

_images/case_study_inteff.png

On this plot we can see the range of the effect of the intervention. The x axis is the intervention difference (new_number_of_rooms - number_of_rooms) and the y axis is the effect of the intervention (rental_price - rental_price_intervened).