Causal estimation
by: Meera Machado & Valdemar Stentoft-Hansen
(Feyn version 3.1.0 or newer)
You ask: How do you capture the effect size of a given feature in your model? And also how do you evaluate the significance of this effect?
In the field of causal inference, one is interested in what happens with the gathered data if experimental conditions would change. Practically, that's not possible since we can't expose a group of individuals to a certain condition, measure the outcome and go back in time to undo this process and measure the outcome again.
Feyn
daringly enters the field of causal inference using Targeted Maximum Likelihood Estimation (TMLE) [1, 2].
TMLE allows for machine learning models to specify the functional form of the problem replacing the oversimplificating linear model usually applied for this task. TMLE applies an ensemble of learners to find the best fit to the data thereby solving for any model misspecifications that will occur with a large number of variables and potentially complex replationships between them. However, leaving the model specification to a more or less black box machine learning model can lead to unfortunate surprises.
The QLattice hits the sweet spot by replacing the ensemble of machine learning models TMLE applies in its default form to instead surface how the confounding mechanisms work by displaying the covariates effect on both the outcome as well as the exposure variable. Showing the model dynamics adds confidence in the model since standard sanity checking is applicable.
A simple example
In this TMLE with QLattice guide, we will use the synthetic dataset presented in [3]. There are three confounders: sex
, baseline psychosocial therapy for depression (therapy
), and baseline antidepressant use (antidepressant_use
). There is also an exposure variable which represents physical exercise (exercise
). Lastly, the outcome variable represents depressive symptoms as measured by the Center for Epidemiology Studies Depression Scale (CES_D_score
). Below we show how the data is generated:
import numpy as np
import pandas as pd
import feyn
np.random.seed(1)
N = 1000 # Number of samples
# Confounding variables
x1 = np.random.binomial(1, 0.55, size=N) # Sex
x2 = np.random.binomial(1, 0.3, size=N) # Therapy
x3 = np.random.binomial(1, 0.25, size=N) # Antidepressant use
def logistic(x):
y = 1 / (1 + np.exp(-x))
return y
# Exposure variable: regular physical exercise
A = np.random.binomial(1, logistic(-0.5 + 0.75 * x1 + 1. * x2 + 1.5 * x3), size=N)
# Outcome variable: CES-D score
Y = 24. - 3 * A + 3 * x1 - 4 * x2 - 6 * x3 - 1.5 * A * x3 + np.random.normal(0, 4.5, size=N)
# Making a dataframe:
data = pd.DataFrame(np.column_stack([A, x1, x2, x3, Y]), columns=['exercise', 'sex', 'therapy', 'antidepressant_use', 'CES_D_score'])
data.head()
exercise | sex | therapy | antidepressant_use | CES_D_score | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 0.0 | 33.196669 |
1 | 1.0 | 0.0 | 1.0 | 1.0 | 3.811051 |
2 | 1.0 | 1.0 | 1.0 | 0.0 | 17.807860 |
3 | 0.0 | 1.0 | 1.0 | 0.0 | 23.313328 |
4 | 0.0 | 1.0 | 0.0 | 0.0 | 26.097713 |
In the current TMLE framework, we are interested in calculating the average difference in outcome (CES_D_score
) between those who did exercise (exercise
= 1) and those who didn't exercise (exercise
= 0). This mean difference in CES_D_score
is calculated as
where represents the outcome variable CES_D_score
, is the exposure variable exercise
and the confounding variables sex
, antidepressant_use
and therapy
.
You can interpret the statistical estimand as the mean difference in outcome if you were able to measure it for a world where all individuals were exposed to a certain condition and a parallel world where these same individuals were not exposed.
If we can identify the exposure variable as having a causal effect on the outcome , then could be identifiable as the Average Treatment Effect ().
Before running the TMLE algorithm and calculating the we need to define an outcome model and an exposure model. In other words, you need a graph that predicts CES_D_score
given all input variables cited above and another that predicts exercise
given the confounding variables sex
, antidepressant_use
and therapy
. The idea is that your expected outcome values are corrected for people with certain values who are more likely to exercise
.
Ideally, you would go through multiple iterations of training loops until choosing an outcome and an exposure model (see Using Feyn). For simplicity reasons, we will just run a single training loop to extract an outcome and an exposure model.
Outcome model
Let's quickly extract an outcome model:
train, test = feyn.tools.split(data, ratio=[0.7 ,0.3], stratify='exercise', random_state=1)
stypes = {'exercise': 'c', 'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'CES_D_score'
ql = feyn.QLattice()
models = ql.auto_run(
train,
target,
stypes=stypes,
max_complexity=7
)
outcome_model = models[0]
For a little bit of sanity let's just check the R2 score on the train and test sets:
print(outcome_model.r2_score(train))
print(outcome_model.r2_score(test))
0.4775462925880104
0.4163888995146031
Exposure model
Let's quickly extract an exposure model:
stypes = {'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'exercise'
ql.reset()
models = ql.auto_run(
train[['sex', 'therapy', 'antidepressant_use', target]],
target,
'classification',
stypes=stypes,
max_complexity=7
)
exposure_model = models[0]
Again, we just check the AUC score for train and test sets:
print(exposure_model.roc_auc_score(train))
print(exposure_model.roc_auc_score(test))
0.6929030373831776
0.6744196908131335
Running TMLE
Now that we are equipped with both an outcome and an exposure model, we can finally run TMLE to calculate the plus standard errors (SE) and 95% confidence intervals (CI). The latter is controlled by the alpha
parameter.
from feyn.inference import TMLE
tmle = TMLE(data, outcome_model, exposure_model, alpha=0.05) # Initialising TMLE
tmle.ATE() # Calculating ATE, SE and CI
print(f"ATE: {tmle.ate:.3f}")
print(f"SE: {tmle.ate_se:.3f}")
print(f"CI: ({tmle.ate_ci[0]:.3f} {tmle.ate_ci[1]:.3f})")
ATE: -3.722
SE: 0.329
CI: (-4.367 -3.077)
We can also call a summary
function that outputs the chosen models, and confidence intervals.
tmle.summary()
======================================================================
TMLE w. QLattice
======================================================================
Targeted Maximum Likelihood applied with QLattice graphs.
Average Treatment Effect: -3.722
95.0% two-sided CI: (-4.367 , -3.077)
======================================================================
Outcome model
======================================================================
======================================================================
Exposure model
======================================================================
======================================================================
References:
[1] van der Laan M.J., Rose S., Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY:Springer-Verlag New York; 2011.
[2] van der Laan M.J., Rubin D.B., Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):Article 11.
[3] Schuler M.S., Sherri R., Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. Am J Epidemiol. 2017;185(1):65-73.