# A binary classification case

by: Meera Machado

In this tutorial, we'll be using * Feyn* and the

*QLattice*to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.

```
import numpy as np
import pandas as pd
import feyn
from feyn.tools import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline
```

`1.`

Importing dataset and quick check-up

The Titanic passenger dataset was acquired through the data.world and Encyclopedia Titanica websites.

```
df = pd.read_csv('titanic.csv')
df.head()
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |

1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |

2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |

4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

`1.1`

Dealing with missing data

```
# Checking which columns have nan values:
df.isna().any()
```

```
pclass False
survived False
name False
sex False
age True
sibsp False
parch False
ticket False
fare False
cabin True
embarked False
boat True
body True
home.dest True
dtype: bool
```

Among all the features containing NaN values, `age`

is the one of most interest.

```
df[df.age.isna()]
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

816 | 3 | 0 | Gheorgheff, Mr. Stanio | male | NaN | 0 | 0 | 349254 | 7.8958 | NaN | C | NaN | NaN | NaN |

940 | 3 | 0 | Kraeff, Mr. Theodor | male | NaN | 0 | 0 | 349253 | 7.8958 | NaN | C | NaN | NaN | NaN |

Note that the only gentlemen whose ages are missing share the same feature values (with exception of `ticket`

). In this case, we shall take a simple approach to guessing their ages: a random number between $\langle x \rangle - \sigma_x$ and $\langle x \rangle + \sigma_x$, where $\langle x \rangle$ and $\sigma_x$ are, respectively, the mean `age`

and standard deviation of all people sharing the same feature values.

```
age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') &
(df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna()
mean_age = np.mean(age_dist)
std_age = np.std(age_dist)
np.random.seed(42)
age_guess = np.random.uniform(mean_age - std_age, mean_age + std_age, size=2)
```

```
# In a simple manner, we drop some features which could be irrelevant (at first look)
df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1)
df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess
```

`2.`

Training session

`2.1`

Splitting data in *train*, *validation* and *holdout* sets:

We wish to make a prediction on the probability of surviving the Titanic sinking, so we set `survived`

to be our *target* variable.

```
target = 'survived'
# Train and test
train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[target])
# Validation and holdout:
valid, hold = train_test_split(test, test_size=0.4, stratify=test[target], random_state=42)
```

Here we make sure that `pclass`

ends up labelled as a categorical variable:

```
train['pclass'] = train.pclass.astype(str)
```

```
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
```

`2.2`

Pre-training time:

First we connect to a *QLattice* through its unique url and API token. Since this is a local domain, no url or token are necessary. A *QLattice* can also be reset of all its learnings, if the user wishes to begin from a clean slate.

```
ql = feyn.QLattice() # Connecting
ql.reset() # Resetting
```

Each feature from the dataset including the target variable interacts with the *QLattice* through *registers*. These are divided into `input`

and `output`

and accommodate `numerical`

and `categorical`

types of variables.

In the following cell, we assign the features of `train`

to input and output *registers* by passing their name strings as a parameter in the function * get_register*. For the input

*registers*, one should indicate the variable type as

*register_type='fixed'*for numerical or 'cat' for categorical. Default is 'fixed'.

```
in_regs = [] # Begins the list of input registers
print('Categorical variables:')
for var in train.columns:
if type(train[var].iloc[0]) is str: # Hence the need to set 'pclass' as str type
print(var)
in_regs.append(ql.get_register(var, register_type='cat'))
else:
in_regs.append(ql.get_register(var, register_type='fixed'))
out_reg = ql.get_register(target) # The output register only allows for the continuous type, since the output is a probability
```

```
Categorical variables:
pclass
sex
embarked
```

After naming the *registers* it is possible to extract a *QGraph* from the *QLattice*. A *QGraph* has the input and output *registers* as parameters along with the *max_depth* option. The latter determines the maximum size of a *graph* inside the *QGraph*.

```
qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=3)
```

Simply put, a *QGraph* is a collection of *graphs*:

```
qgraph.head(3)
```

And each of these *graphs* is in fact a possible model for the problem at hand.

It should be stressed that no training has taken place yet. As one may have noticed, aside from the column names, no data has actually been fed into the *QLattice*!

`2.3`

Actual training time

Training with the *QLattice* occurs in the following steps:

- Extract a
*QGraph*(as exemplified above); - Fit the training data to the
*graphs*from the*QGraph*. One may:- Specify the number of epochs each
*graph*should be trained (forward propagation and gradient descent); - Choose a loss function between
*mean_squared_error*,*mean_absolute_error*and*categorical_cross_entropy*; - Set a number of threads for fitting the
*graphs*; - Choose what should be displayed while training;

- Specify the number of epochs each
- Select the
*graph*that performs best on a chosen dataset according to a loss function and other optional criteria, such as depth; - Update the
*QLattice*with the*graph*selected above; - Repeat the process from step 1: now
should return a`get_qgraph`

*QGraph*with the chosen best*graph*and others with similar architecture.

```
# Defining the number of loops/updates and epochs
nloops = 5
nepochs = 10
# And training finally begins
for loop in range(nloops): # (5)
qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=4) # (1)
qgraph.fit(train, epochs=nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2)
best = qgraph.select(train, loss_function='mean_squared_error')[0] # (3)
ql.update(best) # (4)
```

Note: the text "Examined n of N." indicates the number of *graphs* (n) examined out of their total number (N) in the *QGraph*.

The model above is the one that returns the smallest mean squared error loss when `train`

is fit to the *graphs* in the *QGraph*.

Below we see the top 3 *graphs* that follow the same criteria. Note that they may have different sizes, distinct activation functions in their nodes and even different selection of input features!

```
qgraph.head(3)
```

`3.`

Model evaluation

It is now time to verify the survival probabilities predicted by the top *graph* above. The *graph* object has a function called * predict* which we call with the validation set

`valid`

as parameter. One could also call it with `train`

or `hold`

.```
best_graph = qgraph.select(train)[0]
df_pred = valid.copy()
df_pred['predicted'] = best_graph.predict(valid)
```

We then employ a set of metrics to evaluate the chosen model:

`Confusion matrix`

How many people were correctly classified as survivors (`True`

) and victims (`False`

)?

```
threshold = 0.5
y_pred = np.where(df_pred.predicted > threshold, True, False)
y_true = df_pred[target].astype(bool)
plt.figure(figsize=(8, 5))
plot_confusion_matrix(y_true, y_pred)
```

```
train_pred = best_graph.predict(train)
print('Overall training accuracy: %.4f' %np.mean(np.round(train_pred) == train[target]))
print('Overall validation accuracy: %.4f' %np.mean(y_pred == y_true))
```

```
Overall training accuracy: 0.8038
Overall validation accuracy: 0.7803
```

`Histograms of the probability scores`

The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.

```
def pos_neg_classes(y_true, y_pred):
"""Finds the probability distribution
of the positive and negative classes"""
# Hits and non-hits
hits = y_pred[np.round(y_pred) == y_true] # Finds the indices and values where the prediction gets is right: TP and TN
non_hits = y_pred[np.round(y_pred) != y_true] # Finds the indices and values where the prediction gets it wrong: FP and FN
# From the whole set of predictions, what is actually from the positive and negative classes?
pos_class = np.append(hits[np.round(hits) == 1], non_hits[np.round(non_hits) == 0]) # TP and FN
neg_class = np.append(hits[np.round(hits) == 0], non_hits[np.round(non_hits) == 1]) # TN and FP
return pos_class, neg_class
def plot_hist(pos_class, neg_class, title):
"""Plot the desired histograms"""
plt.hist(neg_class, label='Negative Class', color='teal', ec='darkslategrey', lw=1.5)
plt.hist(pos_class, label='Positive Class', color='crimson', ec='darkred', lw=1.5, alpha=0.7)
plt.legend(loc='upper center', fontsize=12)
plt.ylabel('Number of ocurrences', fontsize=14)
plt.xlabel('Probability Score', fontsize=14)
plt.title(title, fontsize=14)
###
plt.figure(figsize=(18, 4))
plt.subplot(131)
# For all passengers
pos_class, neg_class = pos_neg_classes(df_pred.survived, df_pred.predicted)
plot_hist(pos_class, neg_class, title='Total probability distribution')
plt.subplot(132)
# For men only
pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'male'].survived,
df_pred[df_pred.sex == 'male'].predicted)
plot_hist(pos_class, neg_class, title='Probability distribution for males')
plt.subplot(133)
# For women only
pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'female'].survived,
df_pred[df_pred.sex == 'female'].predicted)
plot_hist(pos_class, neg_class, title='Probability distribution for females')
plt.show()
```

The histograms above give an initial idea of model performance through the probability score distributions. For instance, a naive approach to this problem would be to predict all women as survivors and all men as victims. Does that happen in this case?

`Receiving Operating Characteristic (ROC) curve`

Yet another common metric for evaluating binary classification models. It helps on setting an optimal threshold for the probability scores. Additionally, it can be an auxiliary tool in model selection.

```
fpr, tpr, threshs = roc_curve(df_pred.survived, df_pred.predicted)
```

```
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, color='navy', lw=1.7)
plt.plot(np.linspace(0, 1, 100), np.linspace(0, 1, 100), 'r--', lw=1.7)
plt.xlabel('False Positive rate', fontsize=14)
plt.ylabel('True Positive rate', fontsize=14)
plt.title('ROC curve', fontsize=14)
plt.show()
```

It is interesting to note from the Titanic dataset that no matter how much one trains a model, there is always a factor of luck which cannot be predicted. From a 3rd class man who pretended to be a woman to get onto a lifeboat to a 1st class woman who chose to stay with her husband aboard the Titanic: humans are fully capable of bending their fate.

### Suggestions for next steps:

Now that we went through the first exploration of * Feyn* to a binary classification problem, there are a few extra things one could try:

- Changing the number of updating loops and epochs;
- Fitting and selecting
*graphs*according to different loss functions; - Setting the
*graphs*max_depth to higher or smaller values, such as 5 or 3; - Selecting the best
*graph*according to the validation set (`valid`

) instead of`train`

;

A summary of the full set of steps are depicted below: from resetting the QLattice, to naming the registers and the training itself. Feel free to explore and have fun by changing this notebook as you wish! :D

```
# Resetting QLattice
ql.reset()
# Assigning registers
in_regs = []
for var in train.columns:
if type(train[var].iloc[0]) is str:
in_regs.append(ql.get_register(var, register_type='cat'))
else:
in_regs.append(ql.get_register(var, register_type='fixed'))
out_reg = ql.get_register(target)
# Defining the number of loops/updates and epochs
nloops = 5
nepochs = 10
# Training
for loop in range(nloops): # (5)
qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=4) # (1)
qgraph.fit(train, epochs=nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2)
best = qgraph.select(train, loss_function='mean_squared_error')[0] # (3)
ql.update(best) # (4)
```