Feyn Documentation

Feyn Documentation

  • Learn
  • Guides
  • Tutorials
  • API Reference
  • FAQ

›Classification

Overview

  • Tutorials

Beginner

    Classification

    • Titanic survival
    • Pulsar stars
    • Poisonous Mushrooms

    Regression

    • Airbnb prices
    • Automobile MPG
    • Concrete strength

Advanced

    Regression

    • Wine Quality

Use cases

  • Rewriting models with correlated inputs
  • Complexity-Loss Trade-Off
  • Plotting the loss graph
  • Simple linear and logistic regression
  • Deploy a model for inference

Life Sciences

    Classification

    • Detecting Liver Cancer (HCC) in Plasma
    • Classifying toxicity of antisense oligonucleotides

    Regression

    • Covid-19 RNA vaccine degradation data set
    • Preventing the Honeybee Apocalypse (QSAR)

Interfacing with R

  • Classifying toxicity of antisense oligonucleotides

Archive

  • Covid-19 vaccination RNA dataset.

Titanic survival

by: Meera Machado & Chris Cave

Feyn version: 2.1.+

Last updated: 23/09/2021

In this tutorial, we'll be using Feyn and the QLattice to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.

import numpy as np
import pandas as pd

import feyn
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

Importing dataset and quick check-up

The Titanic passenger dataset was acquired through the data.world and Encyclopedia Titanica websites.

df = pd.read_csv('../data/titanic.csv')
df.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON

Dealing with missing data

# Checking which columns have nan values:
df.columns[df.isna().any().values].to_list()
['age', 'cabin', 'boat', 'body', 'home.dest']

Among all the input features containing NaN values, age is the one of most interest.

df[df.age.isna()]
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
816 3 0 Gheorgheff, Mr. Stanio male NaN 0 0 349254 7.8958 NaN C NaN NaN NaN
940 3 0 Kraeff, Mr. Theodor male NaN 0 0 349253 7.8958 NaN C NaN NaN NaN

We take the following approach to guessing the missing ages: a random number between ⟨x⟩−σx\langle x \rangle - \sigma_x⟨x⟩−σx​ and ⟨x⟩+σx\langle x \rangle + \sigma_x⟨x⟩+σx​, where ⟨x⟩\langle x \rangle⟨x⟩ and σx\sigma_xσx​ are, respectively, the mean age and standard deviation of all people sharing the same feature values.

age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') & 
              (df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna()

mean_age = np.mean(age_dist)
std_age = np.std(age_dist)

np.random.seed(42)
age_guess = np.random.normal(mean_age, std_age, size=2)
# In a simple manner, we drop some features which could be irrelevant (at first look) 
df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1)
df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess

Training session

Splitting data in train, validation and holdout sets

We wish to make a prediction on the probability of surviving the Titanic sinking, so we set survived to be our output variable.

output = 'survived'

# Train and test
train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[output])

# Validation and holdout:
valid, hold = train_test_split(test, test_size=0.4, stratify=test[output], random_state=42)

Connect to QLattice

First we connect to a QLattice. Here we connect to the community QLattice

ql = feyn.connect_qlattice() # Connecting

Specify categorical variables

stypes = {}

for col in train.columns:
    if train[col].dtype == 'O':
        stypes[col] = 'c'
        
stypes['pclass'] = 'c' # Making sure that pclass, which is of int type, is considered a categorical variable

Sample and fit models

This occurs in the following steps:

  1. Sample models from the QLattice;
  2. Fit the models by minimizing bic;
  3. Update the QLattice with the best models' structures;
  4. Repeat the process;

This is all captured within the auto_run function

# Resetting
ql.reset(random_seed=42)
models = ql.auto_run(train, output, kind='classification', stypes=stypes)
Loss: 4.66E-01Epoch no. 10/10 - Tried 18163 models - Completed in 39s.survived logistic: w=1.4986 bias=0.7931survived0outaddadd1multiplymultiply2addadd3pclass categorical with 3 values bias=-0.9196pclass4catsex categorical with 2 values bias=0.6423sex5catage linear: scale=0.025105 w=2.201962 bias=0.7612age6numembarked categorical with 3 values bias=0.0286embarked7cat

The model above is the one that returns the lowest bic during fitting.

best_model = models[0]

Model evaluation

best_model.plot(train, valid)
survived logistic: w=1.4986 bias=0.7931survived0outaddadd1multiplymultiply2addadd3pclass categorical with 3 values bias=-0.9196pclass4catsex categorical with 2 values bias=0.6423sex5catage linear: scale=0.025105 w=2.201962 bias=0.7612age6numembarked categorical with 3 values bias=0.0286embarked7catTraining MetricsAccuracy0.791AUC0.838Precision0.748Recall0.683Test0.7990.8350.7940.642Inputspclasssexageembarked

Training Metrics

Test

On the right we display a few classification metrics for train and valid ("Test").

Histograms of the probability scores

The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.

fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_probability_scores(train, title='Total probability distribution', ax=ax[0])
best_model.plot_probability_scores(train.query('sex == "female"'), title='Probability distribution for females', ax=ax[1])
best_model.plot_probability_scores(train.query('sex == "male"'), title='Probability distribution for males', ax=ax[2])
plt.show()

png

Confusion matrix

fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_confusion_matrix(valid, ax=ax[0], title='All')
best_model.plot_confusion_matrix(valid.query('sex == "female"'), ax=ax[1], title='Females')
best_model.plot_confusion_matrix(valid.query('sex == "male"'), ax=ax[2], title='Males')

png

Partial plots

Below are partial dependence plots (PDP) for females and males separately:

train_f = train.query('sex == "female"').copy()
best_model.plot_response_1d(train_f, by='age', input_constraints={'pclass': [1,2,3], 'sibsp': train_f['sibsp'].median()})

png

For women we see that the same behaviour is present for the 1st\mathrm{1^{st}}1st and 2nd\mathrm{2^{nd}}2nd classes: as age increases, so does their chance of survival. However, the opposite happens for women in the 3rd\mathrm{3^{rd}}3rd class! As their age increases, their chance of survival decreases.

train_m = train.query('sex == "male"').copy()
best_model.plot_response_1d(train_m, by='age', input_constraints={'sex': 'male', 'pclass': [1,2,3], 'sibsp': train_m['sibsp'].median()})

png

The trend for men remains the same throughout the classes: as age increases, the chance of survival decreases. However, male children have a probability score above 0.5.

← TutorialsPulsar stars →
  • Importing dataset and quick check-up
    • Dealing with missing data
  • Training session
    • Splitting data in train, validation and holdout sets
    • Connect to QLattice
    • Specify categorical variables
    • Sample and fit models
  • Model evaluation
    • Histograms of the probability scores
    • Confusion matrix
    • Partial plots

Subscribe to get news about Feyn and the QLattice.

You can opt out at any time, and you can read our privacy policy here.

Copyright © 2024 Abzu.ai - Feyn license: CC BY-NC-ND 4.0
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®