# A binary classification case

by: Meera Machado

Feyn version: 1.2.+

In this tutorial, we'll be using * Feyn* and the

*QLattice*to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.

```
import numpy as np
import pandas as pd
import feyn
from feyn.tools import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline
```

`1.`

Importing dataset and quick check-up

The Titanic passenger dataset was acquired through the data.world and Encyclopedia Titanica websites.

```
df = pd.read_csv('titanic.csv')
df.head()
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |

1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |

2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |

4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |

`1.1`

Dealing with missing data

```
# Checking which columns have nan values:
df.isna().any()
```

```
pclass False
survived False
name False
sex False
age True
sibsp False
parch False
ticket False
fare False
cabin True
embarked False
boat True
body True
home.dest True
dtype: bool
```

Among all the features containing NaN values, `age`

is the one of most interest.

```
df[df.age.isna()]
```

pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

816 | 3 | 0 | Gheorgheff, Mr. Stanio | male | NaN | 0 | 0 | 349254 | 7.8958 | NaN | C | NaN | NaN | NaN |

940 | 3 | 0 | Kraeff, Mr. Theodor | male | NaN | 0 | 0 | 349253 | 7.8958 | NaN | C | NaN | NaN | NaN |

Note that the only gentlemen whose ages are missing share the same feature values (with exception of `ticket`

). In this case, we shall take a simple approach to guessing their ages: a random number between $\langle x \rangle - \sigma_x$ and $\langle x \rangle + \sigma_x$, where $\langle x \rangle$ and $\sigma_x$ are, respectively, the mean `age`

and standard deviation of all people sharing the same feature values.

```
age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') &
(df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna()
mean_age = np.mean(age_dist)
std_age = np.std(age_dist)
np.random.seed(42)
age_guess = np.random.uniform(mean_age - std_age, mean_age + std_age, size=2)
```

```
# In a simple manner, we drop some features which could be irrelevant (at first look)
df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1)
df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess
```

`2.`

Training session

`2.1`

Splitting data in *train*, *validation* and *holdout* sets:

We wish to make a prediction on the probability of surviving the Titanic sinking, so we set `survived`

to be our *target* variable.

```
target = 'survived'
# Train and test
train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[target])
# Validation and holdout:
valid, hold = train_test_split(test, test_size=0.4, stratify=test[target], random_state=42)
```

`2.2`

Pre-training time:

First we connect to a *QLattice* through its unique url and API token. Since this is a local domain, no url or token are necessary. A *QLattice* can also be reset of all its learnings, if the user wishes to begin from a clean slate.

```
ql = feyn.QLattice() # Connecting
ql.reset() # Resetting
```

Each feature from the dataset including the target variable interacts with the *QLattice* through *registers*. These are divided into `input`

and `output`

and accommodate `numerical`

and `categorical`

types of variables.

In the following cell, we assign the features of `train`

that are categorical to what we call `semantic types`

, or `stypes`

. By making this mapping between the columns and their type, the QLattice knows to use a categorical register. The default mapping is numerical and doesn't need to be stated, so only the ones that are categorical (`categorical`

, `cat`

or `c`

) are necessary to specify.

```
stypes = {
'pclass': 'c',
'sex': 'c',
'embarked': 'c'
}
```

After naming the *registers* it is possible to extract a *QGraph* from the *QLattice*. A *QGraph* has the input and output *registers* as parameters along with the *max_depth* option. The latter determines the maximum size of a *graph* inside the *QGraph*.

```
qgraph = ql.get_qgraph(train.columns, target, max_depth=3, stypes=stypes)
```

Simply put, a *QGraph* is a collection of *graphs*:

```
qgraph.head(3)
```

And each of these *graphs* is in fact a possible model for the problem at hand.

It should be stressed that no training has taken place yet. As one may have noticed, aside from the column names, no data has actually been fed into the *QLattice*!

`2.3`

Actual training time

Training with the *QLattice* occurs in the following steps:

- Extract a
*QGraph*(as exemplified above); - Fit the training data to the
*graphs*from the*QGraph*. One may:- Specify the number of epochs each
*graph*should be trained (using the nsamples and choosing a sample size larger than the samples in the dataset); - Choose a loss function between
*mean_squared_error*,*mean_absolute_error*and*categorical_cross_entropy*; - Set a number of threads for fitting the
*graphs*; - Choose what should be displayed while training;

- Specify the number of epochs each
- Sort the
*QGraph*to extract the*graph*that performs best on a chosen dataset according to a loss function; - Update the
*QLattice*with the*graph*selected above; - Repeat the process from step 1: now
should return a`get_qgraph`

*QGraph*with the chosen best*graph*and others with similar architecture.

```
# Defining the number of loops/updates and epochs
nloops = 5
nepochs = 10
qgraph = ql.get_qgraph(train.columns, target, max_depth=4, stypes=stypes) # (1)
# And training finally begins
for loop in range(nloops): # (5)
qgraph.fit(train, n_samples=len(train)*nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2)
best = qgraph.sort(train, loss_function='mean_squared_error')[0] # (3)
ql.update(best) # (4)
```

The model above is the one that returns the smallest mean squared error loss when `train`

is fit to the *graphs* in the *QGraph*.

Below we see the top 3 *graphs* that follow the same criteria. Note that they may have different sizes, distinct activation functions in their nodes and even different selection of input features!

```
qgraph.head(3)
```

`3.`

Model evaluation

It is now time to verify the survival probabilities predicted by the top *graph* above. The *graph* object has a function called * predict* which we call with the validation set

`valid`

as parameter. One could also call it with `train`

or `hold`

.```
best_graph = qgraph.sort(train)[0]
df_pred = valid.copy()
df_pred['predicted'] = best_graph.predict(valid)
```

We then employ a set of metrics to evaluate the chosen model:

`Confusion matrix`

How many people were correctly classified as survivors (`True`

) and victims (`False`

)?

```
threshold = 0.5
y_pred = np.where(df_pred.predicted > threshold, True, False)
y_true = df_pred[target].astype(bool)
plt.figure(figsize=(8, 5))
plot_confusion_matrix(y_true, y_pred)
```

```
train_pred = best_graph.predict(train)
print('Overall training accuracy: %.4f' %np.mean(np.round(train_pred) == train[target]))
print('Overall validation accuracy: %.4f' %np.mean(y_pred == y_true))
```

```
Overall training accuracy: 0.7873
Overall validation accuracy: 0.8057
```

`Histograms of the probability scores`

The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.

```
def pos_neg_classes(y_true, y_pred):
"""Finds the probability distribution
of the positive and negative classes"""
# Hits and non-hits
hits = y_pred[np.round(y_pred) == y_true] # Finds the indices and values where the prediction gets is right: TP and TN
non_hits = y_pred[np.round(y_pred) != y_true] # Finds the indices and values where the prediction gets it wrong: FP and FN
# From the whole set of predictions, what is actually from the positive and negative classes?
pos_class = np.append(hits[np.round(hits) == 1], non_hits[np.round(non_hits) == 0]) # TP and FN
neg_class = np.append(hits[np.round(hits) == 0], non_hits[np.round(non_hits) == 1]) # TN and FP
return pos_class, neg_class
def plot_hist(pos_class, neg_class, title):
"""Plot the desired histograms"""
plt.hist(neg_class, label='Negative Class', color='teal', ec='darkslategrey', lw=1.5)
plt.hist(pos_class, label='Positive Class', color='crimson', ec='darkred', lw=1.5, alpha=0.7)
plt.legend(loc='upper center', fontsize=12)
plt.ylabel('Number of ocurrences', fontsize=14)
plt.xlabel('Probability Score', fontsize=14)
plt.title(title, fontsize=14)
###
plt.figure(figsize=(18, 4))
plt.subplot(131)
# For all passengers
pos_class, neg_class = pos_neg_classes(df_pred.survived, df_pred.predicted)
plot_hist(pos_class, neg_class, title='Total probability distribution')
plt.subplot(132)
# For men only
pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'male'].survived,
df_pred[df_pred.sex == 'male'].predicted)
plot_hist(pos_class, neg_class, title='Probability distribution for males')
plt.subplot(133)
# For women only
pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'female'].survived,
df_pred[df_pred.sex == 'female'].predicted)
plot_hist(pos_class, neg_class, title='Probability distribution for females')
plt.show()
```

The histograms above give an initial idea of model performance through the probability score distributions. For instance, a naive approach to this problem would be to predict all women as survivors and all men as victims. Does that happen in this case?

`Receiving Operating Characteristic (ROC) curve`

Yet another common metric for evaluating binary classification models. It helps on setting an optimal threshold for the probability scores. Additionally, it can be an auxiliary tool in model selection.

```
fpr, tpr, threshs = roc_curve(df_pred.survived, df_pred.predicted)
```

```
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, color='navy', lw=1.7)
plt.plot(np.linspace(0, 1, 100), np.linspace(0, 1, 100), 'r--', lw=1.7)
plt.xlabel('False Positive rate', fontsize=14)
plt.ylabel('True Positive rate', fontsize=14)
plt.title('ROC curve', fontsize=14)
plt.show()
```

It is interesting to note from the Titanic dataset that no matter how much one trains a model, there is always a factor of luck which cannot be predicted. From a 3rd class man who pretended to be a woman to get onto a lifeboat to a 1st class woman who chose to stay with her husband aboard the Titanic: humans are fully capable of bending their fate.

### Suggestions for next steps:

Now that we went through the first exploration of * Feyn* to a binary classification problem, there are a few extra things one could try:

- Changing the number of updating loops and nsamples;
- Fitting and selecting
*graphs*according to different loss functions; - Setting the
*graphs*max_depth to higher or smaller values, such as 5 or 3; - Sorting the
*Qgraph*to get the best*graph*according to the validation set (`valid`

) instead of`train`

;

A summary of the full set of steps are depicted below: from resetting the QLattice, to naming the registers and the training itself. Feel free to explore and have fun by changing this notebook as you wish! :D

```
# Resetting QLattice
ql.reset()
# Mark categorical registers
stypes = {
'pclass': 'c',
'sex': 'c',
'embarked': 'c'
}
# Defining the number of loops/updates and epochs
nloops = 5
nepochs = 10
# Training
for loop in range(nloops): # (5)
qgraph = ql.get_qgraph(train.columns, target, max_depth=4, stypes=stypes) # (1)
qgraph.fit(train, n_samples=len(train)*nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2)
best = qgraph.sort(train, loss_function='mean_squared_error')[0] # (3)
ql.update(best) # (4)
```