Causal estimation · Feyn Documentation

by: Meera Machado & Valdemar Stentoft-Hansen
(Feyn version 3.1.0 or newer)

You ask: How do you capture the effect size of a given feature in your model? And also how do you evaluate the significance of this effect?

In the field of causal inference, one is interested in what happens with the gathered data if experimental conditions would change. Practically, that's not possible since we can't expose a group of individuals to a certain condition, measure the outcome and go back in time to undo this process and measure the outcome again.

Feyn daringly enters the field of causal inference using Targeted Maximum Likelihood Estimation (TMLE) [1, 2].

TMLE allows for machine learning models to specify the functional form of the problem replacing the oversimplificating linear model usually applied for this task. TMLE applies an ensemble of learners to find the best fit to the data thereby solving for any model misspecifications that will occur with a large number of variables and potentially complex replationships between them. However, leaving the model specification to a more or less black box machine learning model can lead to unfortunate surprises.

The QLattice hits the sweet spot by replacing the ensemble of machine learning models TMLE applies in its default form to instead surface how the confounding mechanisms work by displaying the covariates effect on both the outcome as well as the exposure variable. Showing the model dynamics adds confidence in the model since standard sanity checking is applicable.

A simple example

In this TMLE with QLattice guide, we will use the synthetic dataset presented in [3]. There are three confounders: sex, baseline psychosocial therapy for depression (therapy), and baseline antidepressant use (antidepressant_use). There is also an exposure variable which represents physical exercise (exercise). Lastly, the outcome variable represents depressive symptoms as measured by the Center for Epidemiology Studies Depression Scale (CES_D_score). Below we show how the data is generated:

import numpy as np
import pandas as pd
import feyn

np.random.seed(1)
N = 1000  # Number of samples

# Confounding variables
x1 = np.random.binomial(1, 0.55, size=N)  # Sex
x2 = np.random.binomial(1, 0.3, size=N)  # Therapy
x3 = np.random.binomial(1, 0.25, size=N)  # Antidepressant use

def logistic(x):
    y = 1 / (1 + np.exp(-x))
    return y

# Exposure variable: regular physical exercise
A = np.random.binomial(1, logistic(-0.5 + 0.75 * x1 + 1. * x2 + 1.5 * x3), size=N)

# Outcome variable: CES-D score
Y = 24. - 3 * A + 3 * x1 - 4 * x2 - 6 * x3 - 1.5 * A * x3 + np.random.normal(0, 4.5, size=N)

# Making a dataframe:
data = pd.DataFrame(np.column_stack([A, x1, x2, x3, Y]), columns=['exercise', 'sex', 'therapy', 'antidepressant_use', 'CES_D_score'])

data.head()

	exercise	sex	therapy	antidepressant_use	CES_D_score
0	0.0	1.0	0.0	0.0	33.196669
1	1.0	0.0	1.0	1.0	3.811051
2	1.0	1.0	1.0	0.0	17.807860
3	0.0	1.0	1.0	0.0	23.313328
4	0.0	1.0	0.0	0.0	26.097713

In the current TMLE framework, we are interested in calculating the average difference in outcome (CES_D_score) between those who did exercise (exercise = 1) and those who didn't exercise (exercise = 0). This mean difference in CES_D_score is calculated as

$\psi = E_X[E[Y | A = 1, \mathbf{X}] - E[Y | A = 0, \mathbf{X}]],$

where $Y$ represents the outcome variable CES_D_score, $A$ is the exposure variable exercise and $\mathbf{X}$ the confounding variables sex, antidepressant_use and therapy.

You can interpret the statistical estimand $\psi$ as the mean difference in outcome if you were able to measure it for a world where all individuals were exposed to a certain condition $A$ and a parallel world where these same individuals were not exposed.

If we can identify the exposure variable $A$ as having a causal effect on the outcome $Y$ , then $\psi$ could be identifiable as the Average Treatment Effect ( $ATE$ ).

Before running the TMLE algorithm and calculating the $ATE$ we need to define an outcome model and an exposure model. In other words, you need a graph that predicts CES_D_score given all input variables cited above and another that predicts exercise given the confounding variables sex, antidepressant_use and therapy. The idea is that your expected outcome values $E[Y | A, \mathbf{X}]$ are corrected for people with certain $\mathbf{X}$ values who are more likely to exercise.

Ideally, you would go through multiple iterations of training loops until choosing an outcome and an exposure model (see Using Feyn). For simplicity reasons, we will just run a single training loop to extract an outcome and an exposure model.

Outcome model

Let's quickly extract an outcome model:

train, test = feyn.tools.split(data, ratio=[0.7 ,0.3], stratify='exercise', random_state=1)

stypes = {'exercise': 'c', 'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'CES_D_score'

ql = feyn.QLattice()

models = ql.auto_run(
    train,
    target,
    stypes=stypes,
    max_complexity=7
)

outcome_model = models[0]

outcome graph

For a little bit of sanity let's just check the R2 score on the train and test sets:

print(outcome_model.r2_score(train))
print(outcome_model.r2_score(test))

0.4775462925880104
0.4163888995146031

Exposure model

Let's quickly extract an exposure model:

stypes = {'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'exercise'

ql.reset()

models = ql.auto_run(
    train[['sex', 'therapy', 'antidepressant_use', target]],
    target,
    'classification',
    stypes=stypes,
    max_complexity=7
)

exposure_model = models[0]

exposure graph

Again, we just check the AUC score for train and test sets:

print(exposure_model.roc_auc_score(train))
print(exposure_model.roc_auc_score(test))

0.6929030373831776
0.6744196908131335

Running TMLE

Now that we are equipped with both an outcome and an exposure model, we can finally run TMLE to calculate the $ATE$ plus standard errors (SE) and 95% confidence intervals (CI). The latter is controlled by the alpha parameter.

from feyn.inference import TMLE

tmle = TMLE(data, outcome_model, exposure_model, alpha=0.05)  # Initialising TMLE
tmle.ATE()  # Calculating ATE, SE and CI

print(f"ATE: {tmle.ate:.3f}")
print(f"SE: {tmle.ate_se:.3f}")
print(f"CI: ({tmle.ate_ci[0]:.3f} {tmle.ate_ci[1]:.3f})")

ATE: -3.722
SE: 0.329
CI: (-4.367 -3.077)

We can also call a summary function that outputs the chosen models, $ATE$ and confidence intervals.

tmle.summary()

======================================================================
                       TMLE w. QLattice
======================================================================
Targeted Maximum Likelihood applied with QLattice graphs.

Average Treatment Effect:  -3.722
95.0% two-sided CI: (-4.367 , -3.077)

======================================================================
                       Outcome model
======================================================================

======================================================================
                       Exposure model
======================================================================

======================================================================

References:

[1] van der Laan M.J., Rose S., Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY:Springer-Verlag New York; 2011.

[2] van der Laan M.J., Rubin D.B., Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):Article 11.

[3] Schuler M.S., Sherri R., Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. Am J Epidemiol. 2017;185(1):65-73.

A simple example

Outcome model

Exposure model

Running TMLE

Subscribe to get news about Feyn and the QLattice.