Rewriting models with correlated inputs
by: Meera Machado
Feyn version: 2.1.1+
Last updated: 13/10/2021
In this tutorial we use the query language
to generate models where an input variable is substituted by another variable correlated to it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import feyn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
Dataset and motivation
When we consider the array of data gathered for understanding diseases and diagnosing them, it is not surprising that we would find that some input variables are correlated to each other. After all, when it comes to the studying of a living being, we would expect many of its internal processes to be interconnected.
The QLattice
often outputs models that do not contain all the input variables in a dataset. Therefore, it is likely that two correlated variables will not show up together in the same model and one might be chosen over the other. Then we can raise the question: can we substitute one of the input variables (in a model) by another correlated to it?
The Winsconsin Breast Cancer dataset is ideal to tackle this input-swapping workflow due to some of its variables being highly correlated.
data_dict = load_breast_cancer()
data = pd.DataFrame(data=data_dict['data'], columns=data_dict['feature_names'])
data['target'] = data_dict['target']
corr_matrix = data.drop('target', axis=1).corr()
fig, ax = plt.subplots(figsize=(12, 12))
im = ax.imshow(corr_matrix.values, cmap="feyn-diverging", vmin=-1, vmax=1)
plt.colorbar(im, ax=ax)
ax.set_xticks(np.arange(len(corr_matrix.columns)))
ax.set_yticks(np.arange(len(corr_matrix.index)))
ax.set_xticklabels(corr_matrix.columns, rotation=90)
ax.set_yticklabels(corr_matrix.index)
plt.title('Pearson correlation between input variables')
plt.show()
Training session
Train-validation-holdout split
rseed = 6095
train, test = train_test_split(data, test_size=0.4, stratify=data['target'], random_state=rseed)
validation, holdout = train_test_split(test, test_size=0.5, stratify=test['target'], random_state=rseed)
Running QLattice
We run the QLattice
as usual to extract a nice model where we will perform the input-swapping:
ql = feyn.connect_qlattice()
ql.reset(rseed)
models = ql.auto_run(train, 'target', kind='classification', n_epochs=20)
best_model = models[0]
Swapping inputs
Let's select the model input whose Pearson correlation's absolute value to another input variable is the highest:
# Correlations for the inputs of interest (train set)
corr_matrix_inputs = train.drop('target', axis=1).corr()
corr_matrix_inputs = corr_matrix_inputs.loc[:, best_model.inputs]
# Same pairs are not interesting
for inp in best_model.inputs:
corr_matrix_inputs.loc[inp, inp] = np.nan
Extracting the pair with highest Pearson correlation:
# Highest correlated pair for each input
highest_pairs = np.abs(corr_matrix_inputs).idxmax(axis=0)
# Getting the values themselves
highest_corr_values = []
for inp in best_model.inputs:
candidate = highest_pairs.loc[inp]
value = corr_matrix_inputs.loc[candidate, inp]
highest_corr_values.append([inp, candidate, value])
highest_corr_values = pd.DataFrame(highest_corr_values, columns=['input', 'candidate', 'corr'])
# Extract the pair with highest Pearson correlation value
max_pair = highest_corr_values.sort_values('corr', ascending=False).iloc[0]
Next we will substitute the input above by its candidate in our best_model
.
Model
to query_string
From An straightforward way to generate the same Model
architecture as best_model
while swapping one of its inputs with another variable is via the query language.
# First we go from the `Model` to its `query_string` representation:
bm_query_str = best_model.to_query_string()
# Then we swap the original input in `best_model` with the chosen candidate:
bm_query_str = bm_query_str.replace(max_pair['input'], max_pair['candidate'])
Training again
Lastly, we generate and train new models with the substituted input by passing the query_string
above:
models_subst = ql.auto_run(train, 'target', kind='classification', query_string=bm_query_str, n_epochs=2)
best_subst = models_subst[0]
Comparing the models
Was there an improvement over the original best_model
?
from IPython.display import display
display(best_model.plot(train, validation))
display(best_subst.plot(train, validation))