Feyn Documentation

Feyn Documentation

  • Learn
  • Guides
  • Tutorials
  • API Reference
  • FAQ

›Use cases

Overview

  • Tutorials

Beginner

    Classification

    • Titanic survival
    • Pulsar stars
    • Poisonous Mushrooms

    Regression

    • Airbnb prices
    • Automobile MPG
    • Concrete strength

Advanced

    Regression

    • Wine Quality

Use cases

  • Rewriting models with correlated inputs
  • Complexity-Loss Trade-Off
  • Plotting the loss graph
  • Simple linear and logistic regression
  • Deploy a model for inference

Life Sciences

    Classification

    • Detecting Liver Cancer (HCC) in Plasma
    • Classifying toxicity of antisense oligonucleotides

    Regression

    • Covid-19 RNA vaccine degradation data set
    • Preventing the Honeybee Apocalypse (QSAR)

Interfacing with R

  • Classifying toxicity of antisense oligonucleotides

Archive

  • Covid-19 vaccination RNA dataset.

Rewriting models with correlated inputs

by: Meera Machado

Feyn version: 2.1.1+

Last updated: 13/10/2021

In this tutorial we use the query language to generate models where an input variable is substituted by another variable correlated to it.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import feyn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

Dataset and motivation

When we consider the array of data gathered for understanding diseases and diagnosing them, it is not surprising that we would find that some input variables are correlated to each other. After all, when it comes to the studying of a living being, we would expect many of its internal processes to be interconnected.

The QLattice often outputs models that do not contain all the input variables in a dataset. Therefore, it is likely that two correlated variables will not show up together in the same model and one might be chosen over the other. Then we can raise the question: can we substitute one of the input variables (in a model) by another correlated to it?

The Winsconsin Breast Cancer dataset is ideal to tackle this input-swapping workflow due to some of its variables being highly correlated.

data_dict = load_breast_cancer()
data = pd.DataFrame(data=data_dict['data'], columns=data_dict['feature_names'])
data['target'] = data_dict['target']
corr_matrix = data.drop('target', axis=1).corr()

fig, ax = plt.subplots(figsize=(12, 12))
im = ax.imshow(corr_matrix.values, cmap="feyn-diverging", vmin=-1, vmax=1)
plt.colorbar(im, ax=ax)

ax.set_xticks(np.arange(len(corr_matrix.columns)))
ax.set_yticks(np.arange(len(corr_matrix.index)))

ax.set_xticklabels(corr_matrix.columns, rotation=90)
ax.set_yticklabels(corr_matrix.index)

plt.title('Pearson correlation between input variables')
plt.show()

png

Training session

Train-validation-holdout split

rseed = 6095
train, test = train_test_split(data, test_size=0.4, stratify=data['target'], random_state=rseed)
validation, holdout = train_test_split(test, test_size=0.5, stratify=test['target'], random_state=rseed)

Running QLattice

We run the QLattice as usual to extract a nice model where we will perform the input-swapping:

ql = feyn.connect_qlattice()
ql.reset(rseed)
models = ql.auto_run(train, 'target', kind='classification', n_epochs=20) 
Loss: 2.47E-02Epoch no. 20/20 - Tried 19819 models - Completed in 1m 24s.target logistic: w=-15.5862 bias=-0.0453target0outaddadd1worst radius linear: scale=0.071149 scale offset=16.522724 w=2.320437 bias=-0.9621worst ra..2numaddadd3mean concavity linear: scale=5.327651 scale offset=0.088380 w=1.187659 bias=0.3999mean con..4numgaussian2gaussian5addadd6worst smoothness linear: scale=13.207423 scale offset=0.131307 w=-0.753036 bias=-1.1921worst sm..7nummean compactness linear: scale=6.134593 scale offset=0.104474 w=0.801735 bias=1.5992mean com..8nummean texture linear: scale=0.069204 scale offset=19.286962 w=1.426717 bias=-0.2123mean tex..9num
best_model = models[0]

Swapping inputs

Let's select the model input whose Pearson correlation's absolute value to another input variable is the highest:

# Correlations for the inputs of interest (train set)
corr_matrix_inputs = train.drop('target', axis=1).corr()
corr_matrix_inputs = corr_matrix_inputs.loc[:, best_model.inputs]

# Same pairs are not interesting
for inp in best_model.inputs:
    corr_matrix_inputs.loc[inp, inp] = np.nan

Extracting the pair with highest Pearson correlation:

# Highest correlated pair for each input
highest_pairs = np.abs(corr_matrix_inputs).idxmax(axis=0)

# Getting the values themselves
highest_corr_values = []

for inp in best_model.inputs:
    candidate = highest_pairs.loc[inp]
    value = corr_matrix_inputs.loc[candidate, inp]
    highest_corr_values.append([inp, candidate, value])
    
highest_corr_values = pd.DataFrame(highest_corr_values, columns=['input', 'candidate', 'corr'])

# Extract the pair with highest Pearson correlation value
max_pair = highest_corr_values.sort_values('corr', ascending=False).iloc[0]

Next we will substitute the input above by its candidate in our best_model.

From Model to query_string

An straightforward way to generate the same Model architecture as best_model while swapping one of its inputs with another variable is via the query language.

# First we go from the `Model` to its `query_string` representation:
bm_query_str = best_model.to_query_string()

# Then we swap the original input in `best_model` with the chosen candidate:
bm_query_str = bm_query_str.replace(max_pair['input'], max_pair['candidate'])

Training again

Lastly, we generate and train new models with the substituted input by passing the query_string above:

models_subst = ql.auto_run(train, 'target', kind='classification', query_string=bm_query_str, n_epochs=2)
Loss: 3.48E-02Epoch no. 2/2 - Tried 973 models - Completed in 1s.target logistic: w=-10.7844 bias=0.4167target0outaddadd1worst perimeter linear: scale=0.009961 scale offset=109.069398 w=2.223730 bias=-1.0090worst pe..2numaddadd3mean concavity linear: scale=5.327651 scale offset=0.088380 w=1.293769 bias=0.4760mean con..4numgaussian2gaussian5addadd6worst smoothness linear: scale=13.207423 scale offset=0.131307 w=1.342464 bias=-1.1443worst sm..7nummean compactness linear: scale=6.134593 scale offset=0.104474 w=-2.243622 bias=0.8227mean com..8nummean texture linear: scale=0.069204 scale offset=19.286962 w=-0.989677 bias=0.2865mean tex..9num
best_subst = models_subst[0]

Comparing the models

Was there an improvement over the original best_model?

from IPython.display import display
display(best_model.plot(train, validation))
display(best_subst.plot(train, validation))
target logistic: w=-15.5862 bias=-0.0453target0outaddadd1worst radius linear: scale=0.071149 scale offset=16.522724 w=2.320437 bias=-0.9621worst ra..2numaddadd3mean concavity linear: scale=5.327651 scale offset=0.088380 w=1.187659 bias=0.3999mean con..4numgaussian2gaussian5addadd6worst smoothness linear: scale=13.207423 scale offset=0.131307 w=-0.753036 bias=-1.1921worst sm..7nummean compactness linear: scale=6.134593 scale offset=0.104474 w=0.801735 bias=1.5992mean com..8nummean texture linear: scale=0.069204 scale offset=19.286962 w=1.426717 bias=-0.2123mean tex..9numTraining MetricsAccuracy0.997AUC0.999Precision0.995Recall1.0Test0.9470.9710.9710.944Inputsmean compactnessmean concavitymean textureworst radiusworst smoothness

Training Metrics

Test

target logistic: w=-10.7844 bias=0.4167target0outaddadd1worst perimeter linear: scale=0.009961 scale offset=109.069398 w=2.223730 bias=-1.0090worst pe..2numaddadd3mean concavity linear: scale=5.327651 scale offset=0.088380 w=1.293769 bias=0.4760mean con..4numgaussian2gaussian5addadd6worst smoothness linear: scale=13.207423 scale offset=0.131307 w=1.342464 bias=-1.1443worst sm..7nummean compactness linear: scale=6.134593 scale offset=0.104474 w=-2.243622 bias=0.8227mean com..8nummean texture linear: scale=0.069204 scale offset=19.286962 w=-0.989677 bias=0.2865mean tex..9numTraining MetricsAccuracy0.991AUC0.999Precision0.995Recall0.991Test0.9390.9720.9710.93Inputsmean compactnessmean concavitymean textureworst perimeterworst smoothness

Training Metrics

Test

← Wine QualityComplexity-Loss Trade-Off →
  • Dataset and motivation
  • Training session
    • Train-validation-holdout split
    • Running QLattice
  • Swapping inputs
    • From Model to query_string
    • Training again
  • Comparing the models

Subscribe to get news about Feyn and the QLattice.

You can opt out at any time, and you can read our privacy policy here.

Copyright © 2024 Abzu.ai - Feyn license: CC BY-NC-ND 4.0
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®