# Causal estimation

by: Meera Machado & Valdemar Stentoft-Hansen

(Feyn version 1.4.6 or newer)

You ask: How do you capture the effect size of a given feature in your model? And also how do you evaluate the significance of this effect?

In the field of **causal inference**, one is interested in what happens with the gathered data if experimental conditions would change. Practically, that's not possible since we can't expose a group of individuals to a certain condition, measure the outcome and go back in time to undo this process and measure the outcome again.

`Feyn`

daringly enters the field of **causal inference** using Targeted Maximum Likelihood Estimation (TMLE) [1, 2].

TMLE allows for machine learning models to specify the functional form of the problem replacing the oversimplificating linear model usually applied for this task. TMLE applies an ensemble of learners to find the best fit to the data thereby solving for any model misspecifications that will occur with a large number of variables and potentially complex replationships between them. However, leaving the model specification to a more or less black box machine learning model can lead to unfortunate surprises.

The QLattice hits the sweet spot by replacing the ensemble of machine learning models TMLE applies in its default form to instead surface how the confounding mechanisms work by displaying the covariates effect on both the outcome as well as the exposure variable. Showing the model dynamics adds confidence in the model since standard sanity checking is applicable.

## A simple example

In this TMLE with QLattice guide, we will use the synthetic dataset presented in [3]. There are three confounders: `sex`

, baseline psychosocial therapy for depression (`therapy`

), and baseline antidepressant use (`antidepressant_use`

). There is also an exposure variable which represents physical exercise (`exercise`

). Lastly, the outcome variable represents depressive symptoms as measured by the Center for Epidemiology Studies Depression Scale (`CES_D_score`

). Below we show how the data is generated:

```
import numpy as np
import pandas as pd
import feyn
np.random.seed(1)
N = 1000 # Number of samples
# Confounding variables
x1 = np.random.binomial(1, 0.55, size=N) # Sex
x2 = np.random.binomial(1, 0.3, size=N) # Therapy
x3 = np.random.binomial(1, 0.25, size=N) # Antidepressant use
def logistic(x):
y = 1 / (1 + np.exp(-x))
return y
# Exposure variable: regular physical exercise
A = np.random.binomial(1, logistic(-0.5 + 0.75 * x1 + 1. * x2 + 1.5 * x3), size=N)
# Outcome variable: CES-D score
Y = 24. - 3 * A + 3 * x1 - 4 * x2 - 6 * x3 - 1.5 * A * x3 + np.random.normal(0, 4.5, size=N)
# Making a dataframe:
data = pd.DataFrame(np.column_stack([A, x1, x2, x3, Y]), columns=['exercise', 'sex', 'therapy', 'antidepressant_use', 'CES_D_score'])
data.head()
```

exercise | sex | therapy | antidepressant_use | CES_D_score | |
---|---|---|---|---|---|

0 | 0.0 | 1.0 | 0.0 | 0.0 | 33.196669 |

1 | 1.0 | 0.0 | 1.0 | 1.0 | 3.811051 |

2 | 1.0 | 1.0 | 1.0 | 0.0 | 17.807860 |

3 | 0.0 | 1.0 | 1.0 | 0.0 | 23.313328 |

4 | 0.0 | 1.0 | 0.0 | 0.0 | 26.097713 |

In the current TMLE framework, we are interested in calculating the average difference in outcome (`CES_D_score`

) between those who did exercise (`exercise`

= 1) and those who didn't exercise (`exercise`

= 0). This mean difference in `CES_D_score`

is calculated as

$\psi = E_X[E[Y | A = 1, \mathbf{X}] - E[Y | A = 0, \mathbf{X}]],$

where $Y$ represents the outcome variable `CES_D_score`

, $A$ is the exposure variable `exercise`

and $\mathbf{X}$ the confounding variables `sex`

, `antidepressant_use`

and `therapy`

.

You can interpret the statistical estimand $\psi$ as the mean difference in outcome if you were able to measure it for a world where all individuals were exposed to a certain condition $A$ and a parallel world where these same individuals were not exposed.

If we can identify the exposure variable $A$ as having a causal effect on the outcome $Y$, then $\psi$ could be identifiable as the Average Treatment Effect ($ATE$).

Before running the TMLE algorithm and calculating the $ATE$ we need to define an **outcome model** and an **exposure model**. In other words, you need a graph that predicts `CES_D_score`

given all input variables cited above and another that predicts `exercise`

given the confounding variables `sex`

, `antidepressant_use`

and `therapy`

. The idea is that your expected outcome values $E[Y | A, \mathbf{X}]$ are corrected for people with certain $\mathbf{X}$ values who are more likely to `exercise`

.

Ideally, you would go through multiple iterations of training loops until choosing an outcome and an exposure model (see Using Feyn). For simplicity reasons, we will just run a single training loop to extract an outcome and an exposure model.

### Outcome model

Let's quickly extract an **outcome model**:

```
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.3, stratify=data['exercise'], random_state=1)
stypes = {'exercise': 'c', 'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'CES_D_score'
ql = feyn.connect_qlattice()
models = ql.auto_run(
train,
target,
stypes=stypes,
max_complexity=7
)
outcome_model = models[0]
```

For a little bit of sanity let's just check the R2 score on the **train** and **test** sets:

```
print(outcome_model.r2_score(train))
print(outcome_model.r2_score(test))
```

```
0.4775462925880104
0.4163888995146031
```

### Exposure model

Let's quickly extract an **exposure model**:

```
stypes = {'sex': 'c', 'therapy': 'c', 'antidepressant_use': 'c'}
target = 'exercise'
ql.reset()
models = ql.auto_run(
train[['sex', 'therapy', 'antidepressant_use', target]],
target,
'classification',
stypes=stypes,
max_complexity=7
)
exposure_model = models[0]
```

Again, we just check the AUC score for **train** and **test** sets:

```
print(exposure_model.roc_auc_score(train))
print(exposure_model.roc_auc_score(test))
```

```
0.6929030373831776
0.6744196908131335
```

### Running TMLE

Now that we are equipped with both an **outcome** and an **exposure** model, we can finally run TMLE to calculate the $ATE$ plus standard errors (SE) and 95% confidence intervals (CI). The latter is controlled by the `alpha`

parameter.

```
from feyn.inference import TMLE
tmle = TMLE(data, outcome_model, exposure_model, alpha=0.05) # Initialising TMLE
tmle.ATE() # Calculating ATE, SE and CI
print(f"ATE: {tmle.ate:.3f}")
print(f"SE: {tmle.ate_se:.3f}")
print(f"CI: ({tmle.ate_ci[0]:.3f} {tmle.ate_ci[1]:.3f})")
```

```
ATE: -3.722
SE: 0.329
CI: (-4.367 -3.077)
```

We can also call a `summary`

function that outputs the chosen models, $ATE$ and confidence intervals.

```
tmle.summary()
```

```
======================================================================
TMLE w. QLattice
======================================================================
Targeted Maximum Likelihood applied with QLattice graphs.
Average Treatment Effect: -3.722
95.0% two-sided CI: (-4.367 , -3.077)
======================================================================
Outcome model
======================================================================
```

```
======================================================================
Exposure model
======================================================================
```

```
======================================================================
```

References:

[1] van der Laan M.J., Rose S., *Targeted Learning: Causal Inference for Observational and Experimental Data*. New York, NY:Springer-Verlag New York; 2011.

[2] van der Laan M.J., Rubin D.B., Targeted maximum likelihood learning. *Int J Biostat.* 2006;2(1):Article 11.

[3] Schuler M.S., Sherri R., Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. *Am J Epidemiol.* 2017;185(1):65-73.