Titanic survival

Feyn version: 2.1.+

Last updated: 23/09/2021

In this tutorial, we'll be using Feyn and the QLattice to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.

import numpy as np
import pandas as pd

import feyn
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

Importing dataset and quick check-up

The Titanic passenger dataset was acquired through the data.world and Encyclopedia Titanica websites.

df = pd.read_csv('../data/titanic.csv')
df.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

Dealing with missing data

# Checking which columns have nan values:
df.columns[df.isna().any().values].to_list()

['age', 'cabin', 'boat', 'body', 'home.dest']

Among all the input features containing NaN values, age is the one of most interest.

df[df.age.isna()]

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
816	3	0	Gheorgheff, Mr. Stanio	male	NaN	0	0	349254	7.8958	NaN	C	NaN	NaN	NaN
940	3	0	Kraeff, Mr. Theodor	male	NaN	0	0	349253	7.8958	NaN	C	NaN	NaN	NaN

We take the following approach to guessing the missing ages: a random number between $\langle x \rangle - \sigma_x$ and $\langle x \rangle + \sigma_x$ , where $\langle x \rangle$ and $\sigma_x$ are, respectively, the mean age and standard deviation of all people sharing the same feature values.

age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') & 
              (df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna()

mean_age = np.mean(age_dist)
std_age = np.std(age_dist)

np.random.seed(42)
age_guess = np.random.normal(mean_age, std_age, size=2)

# In a simple manner, we drop some features which could be irrelevant (at first look) 
df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1)
df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess

Training session

Splitting data in train, validation and holdout sets

We wish to make a prediction on the probability of surviving the Titanic sinking, so we set survived to be our output variable.

output = 'survived'

# Train and test
train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[output])

# Validation and holdout:
valid, hold = train_test_split(test, test_size=0.4, stratify=test[output], random_state=42)

Connect to QLattice

First we connect to a QLattice. Here we connect to the community QLattice

ql = feyn.connect_qlattice() # Connecting

Specify categorical variables

stypes = {}

for col in train.columns:
    if train[col].dtype == 'O':
        stypes[col] = 'c'
        
stypes['pclass'] = 'c' # Making sure that pclass, which is of int type, is considered a categorical variable

Sample and fit models

This occurs in the following steps:

Sample models from the QLattice;
Fit the models by minimizing bic;
Update the QLattice with the best models' structures;
Repeat the process;

This is all captured within the auto_run function

# Resetting
ql.reset(random_seed=42)

models = ql.auto_run(train, output, kind='classification', stypes=stypes)

The model above is the one that returns the lowest bic during fitting.

best_model = models[0]

Model evaluation

best_model.plot(train, valid)

Training Metrics

Test

On the right we display a few classification metrics for train and valid ("Test").

Histograms of the probability scores

The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.

fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_probability_scores(train, title='Total probability distribution', ax=ax[0])
best_model.plot_probability_scores(train.query('sex == "female"'), title='Probability distribution for females', ax=ax[1])
best_model.plot_probability_scores(train.query('sex == "male"'), title='Probability distribution for males', ax=ax[2])
plt.show()

png

Confusion matrix

fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_confusion_matrix(valid, ax=ax[0], title='All')
best_model.plot_confusion_matrix(valid.query('sex == "female"'), ax=ax[1], title='Females')
best_model.plot_confusion_matrix(valid.query('sex == "male"'), ax=ax[2], title='Males')

png

Partial plots

Below are partial dependence plots (PDP) for females and males separately:

train_f = train.query('sex == "female"').copy()
best_model.plot_response_1d(train_f, by='age', input_constraints={'pclass': [1,2,3], 'sibsp': train_f['sibsp'].median()})

png

For women we see that the same behaviour is present for the $\mathrm{1^{st}}$ and $\mathrm{2^{nd}}$ classes: as age increases, so does their chance of survival. However, the opposite happens for women in the $\mathrm{3^{rd}}$ class! As their age increases, their chance of survival decreases.

train_m = train.query('sex == "male"').copy()
best_model.plot_response_1d(train_m, by='age', input_constraints={'sex': 'male', 'pclass': [1,2,3], 'sibsp': train_m['sibsp'].median()})

png

The trend for men remains the same throughout the classes: as age increases, the chance of survival decreases. However, male children have a probability score above 0.5.

Feyn Documentation

Classification

Regression

Regression

Classification

Regression

Importing dataset and quick check-up

Dealing with missing data

Training session

Splitting data in train, validation and holdout sets

Connect to QLattice

Specify categorical variables

Sample and fit models

Model evaluation

Training Metrics

Test

Histograms of the probability scores

Confusion matrix

Partial plots

Subscribe to get news about Feyn and the QLattice.