# Titanic survival

by: Meera Machado & Chris Cave

Feyn version: 2.1.+

Last updated: 23/09/2021

In this tutorial, we'll be using `Feyn`

and the `QLattice`

to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.

```
import numpy as np
import pandas as pd
import feyn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
```

## Importing dataset and quick check-up

The Titanic passenger dataset was acquired through the data.world and Encyclopedia Titanica websites.

```
df = pd.read_csv('../data/titanic.csv')
df.head()
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |

1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |

2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |

4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

### Dealing with missing data

```
# Checking which columns have nan values:
df.columns[df.isna().any().values].to_list()
```

```
['age', 'cabin', 'boat', 'body', 'home.dest']
```

Among all the input features containing NaN values, `age`

is the one of most interest.

```
df[df.age.isna()]
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

816 | 3 | 0 | Gheorgheff, Mr. Stanio | male | NaN | 0 | 0 | 349254 | 7.8958 | NaN | C | NaN | NaN | NaN |

940 | 3 | 0 | Kraeff, Mr. Theodor | male | NaN | 0 | 0 | 349253 | 7.8958 | NaN | C | NaN | NaN | NaN |

We take the following approach to guessing the missing ages: a random number between $\langle x \rangle - \sigma_x$ and $\langle x \rangle + \sigma_x$, where $\langle x \rangle$ and $\sigma_x$ are, respectively, the mean `age`

and standard deviation of all people sharing the same feature values.

```
age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') &
(df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna()
mean_age = np.mean(age_dist)
std_age = np.std(age_dist)
np.random.seed(42)
age_guess = np.random.normal(mean_age, std_age, size=2)
```

```
# In a simple manner, we drop some features which could be irrelevant (at first look)
df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1)
df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess
```

## Training session

*train*, *validation* and *holdout* sets

Splitting data in We wish to make a prediction on the probability of surviving the Titanic sinking, so we set `survived`

to be our **output** variable.

```
output = 'survived'
# Train and test
train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[output])
# Validation and holdout:
valid, hold = train_test_split(test, test_size=0.4, stratify=test[output], random_state=42)
```

### Connect to QLattice

First we connect to a `QLattice`

. Here we connect to the community QLattice

```
ql = feyn.connect_qlattice() # Connecting
```

### Specify categorical variables

```
stypes = {}
for col in train.columns:
if train[col].dtype == 'O':
stypes[col] = 'c'
stypes['pclass'] = 'c' # Making sure that pclass, which is of int type, is considered a categorical variable
```

### Sample and fit models

This occurs in the following steps:

- Sample models from the
`QLattice`

; - Fit the models by minimizing
`bic`

; - Update the
`QLattice`

with the best models' structures; - Repeat the process;

This is all captured within the `auto_run`

function

```
# Resetting
ql.reset(random_seed=42)
```

```
models = ql.auto_run(train, output, kind='classification', stypes=stypes)
```

The model above is the one that returns the lowest `bic`

during fitting.

```
best_model = models[0]
```

## Model evaluation

```
best_model.plot(train, valid)
```

#### Training Metrics

#### Test

On the right we display a few classification metrics for `train`

and `valid`

("Test").

### Histograms of the probability scores

The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.

```
fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_probability_scores(train, title='Total probability distribution', ax=ax[0])
best_model.plot_probability_scores(train.query('sex == "female"'), title='Probability distribution for females', ax=ax[1])
best_model.plot_probability_scores(train.query('sex == "male"'), title='Probability distribution for males', ax=ax[2])
plt.show()
```

### Confusion matrix

```
fig, ax = plt.subplots(1, 3, figsize=(20, 6))
best_model.plot_confusion_matrix(valid, ax=ax[0], title='All')
best_model.plot_confusion_matrix(valid.query('sex == "female"'), ax=ax[1], title='Females')
best_model.plot_confusion_matrix(valid.query('sex == "male"'), ax=ax[2], title='Males')
```

### Partial plots

Below are partial dependence plots (PDP) for females and males separately:

```
train_f = train.query('sex == "female"').copy()
best_model.plot_response_1d(train_f, by='age', input_constraints={'pclass': [1,2,3], 'sibsp': train_f['sibsp'].median()})
```

For women we see that the same behaviour is present for the $\mathrm{1^{st}}$ and $\mathrm{2^{nd}}$ classes: as age increases, so does their chance of survival. However, the opposite happens for women in the $\mathrm{3^{rd}}$ class! As their age increases, their chance of survival decreases.

```
train_m = train.query('sex == "male"').copy()
best_model.plot_response_1d(train_m, by='age', input_constraints={'sex': 'male', 'pclass': [1,2,3], 'sibsp': train_m['sibsp'].median()})
```

The trend for men remains the same throughout the classes: as age increases, the chance of survival decreases. However, male children have a probability score above 0.5.