by: Meera Machado
In this tutorial, we'll be using
Feyn and the QLattice to solve a binary classification problem by exploring models that aim to predict the probability of surviving the disaster of the RMS Titanic during her maiden voyage in April of 1912.
import numpy as np import pandas as pd import feyn from feyn.tools import plot_confusion_matrix from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve import matplotlib.pyplot as plt %matplotlib inline
1. Importing dataset and quick check-up
df = pd.read_csv('titanic.csv') df.head()
|0||1||1||Allen, Miss. Elisabeth Walton||female||29.0000||0||0||24160||211.3375||B5||S||2||NaN||St Louis, MO|
|1||1||1||Allison, Master. Hudson Trevor||male||0.9167||1||2||113781||151.5500||C22 C26||S||11||NaN||Montreal, PQ / Chesterville, ON|
|2||1||0||Allison, Miss. Helen Loraine||female||2.0000||1||2||113781||151.5500||C22 C26||S||NaN||NaN||Montreal, PQ / Chesterville, ON|
|3||1||0||Allison, Mr. Hudson Joshua Creighton||male||30.0000||1||2||113781||151.5500||C22 C26||S||NaN||135.0||Montreal, PQ / Chesterville, ON|
|4||1||0||Allison, Mrs. Hudson J C (Bessie Waldo Daniels)||female||25.0000||1||2||113781||151.5500||C22 C26||S||NaN||NaN||Montreal, PQ / Chesterville, ON|
1.1 Dealing with missing data
# Checking which columns have nan values: df.isna().any()
pclass False survived False name False sex False age True sibsp False parch False ticket False fare False cabin True embarked False boat True body True home.dest True dtype: bool
Among all the features containing NaN values,
age is the one of most interest.
|816||3||0||Gheorgheff, Mr. Stanio||male||NaN||0||0||349254||7.8958||NaN||C||NaN||NaN||NaN|
|940||3||0||Kraeff, Mr. Theodor||male||NaN||0||0||349253||7.8958||NaN||C||NaN||NaN||NaN|
Note that the only gentlemen whose ages are missing share the same feature values (with exception of
ticket). In this case, we shall take a simple approach to guessing their ages: a random number between and , where and are, respectively, the mean
age and standard deviation of all people sharing the same feature values.
age_dist = df[(df.pclass == 3) & (df.embarked == 'C') & (df.sex == 'male') & (df.sibsp == 0) & (df.parch == 0) & (df.survived == 0)].age.dropna() mean_age = np.mean(age_dist) std_age = np.std(age_dist) np.random.seed(42) age_guess = np.random.uniform(mean_age - std_age, mean_age + std_age, size=2)
# In a simple manner, we drop some features which could be irrelevant (at first look) df_mod = df.drop(['boat', 'body', 'home.dest', 'name', 'ticket', 'cabin'], axis=1) df_mod.loc[df[df.age.isna()].index, 'age'] = age_guess
2. Training session
2.1 Splitting data in train, validation and holdout sets:
We wish to make a prediction on the probability of surviving the Titanic sinking, so we set
survived to be our target variable.
target = 'survived' # Train and test train, test = train_test_split(df_mod, test_size=0.4, random_state=42, stratify=df_mod[target]) # Validation and holdout: valid, hold = train_test_split(test, test_size=0.4, stratify=test[target], random_state=42)
Here we make sure that
pclass ends up labelled as a categorical variable:
train['pclass'] = train.pclass.astype(str)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
2.2 Pre-training time:
First we connect to a QLattice through its unique url and API token. Since this is a local domain, no url or token are necessary. A QLattice can also be reset of all its learnings, if the user wishes to begin from a clean slate.
ql = feyn.QLattice() # Connecting ql.reset() # Resetting
Each feature from the dataset including the target variable interacts with the QLattice through registers. These are divided into
output and accommodate
categorical types of variables.
In the following cell, we assign the features of
train to input and output registers by passing their name strings as a parameter in the function
get_register. For the input registers, one should indicate the variable type as register_type='fixed' for numerical or 'cat' for categorical. Default is 'fixed'.
in_regs =  # Begins the list of input registers print('Categorical variables:') for var in train.columns: if type(train[var].iloc) is str: # Hence the need to set 'pclass' as str type print(var) in_regs.append(ql.get_register(var, register_type='cat')) else: in_regs.append(ql.get_register(var, register_type='fixed')) out_reg = ql.get_register(target) # The output register only allows for the continuous type, since the output is a probability
Categorical variables: pclass sex embarked
After naming the registers it is possible to extract a QGraph from the QLattice. A QGraph has the input and output registers as parameters along with the max_depth option. The latter determines the maximum size of a graph inside the QGraph.
qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=3)
Simply put, a QGraph is a collection of graphs:
And each of these graphs is in fact a possible model for the problem at hand.
It should be stressed that no training has taken place yet. As one may have noticed, aside from the column names, no data has actually been fed into the QLattice!
2.3 Actual training time
Training with the QLattice occurs in the following steps:
- Extract a QGraph (as exemplified above);
- Fit the training data to the graphs from the QGraph. One may:
- Specify the number of epochs each graph should be trained (forward propagation and gradient descent);
- Choose a loss function between mean_squared_error, mean_absolute_error and categorical_cross_entropy;
- Set a number of threads for fitting the graphs;
- Choose what should be displayed while training;
- Select the graph that performs best on a chosen dataset according to a loss function and other optional criteria, such as depth;
- Update the QLattice with the graph selected above;
- Repeat the process from step 1: now
get_qgraphshould return a QGraph with the chosen best graph and others with similar architecture.
# Defining the number of loops/updates and epochs nloops = 5 nepochs = 10 # And training finally begins for loop in range(nloops): # (5) qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=4) # (1) qgraph.fit(train, epochs=nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2) best = qgraph.select(train, loss_function='mean_squared_error') # (3) ql.update(best) # (4)
Note: the text "Examined n of N." indicates the number of graphs (n) examined out of their total number (N) in the QGraph.
The model above is the one that returns the smallest mean squared error loss when
train is fit to the graphs in the QGraph.
Below we see the top 3 graphs that follow the same criteria. Note that they may have different sizes, distinct activation functions in their nodes and even different selection of input features!
3. Model evaluation
It is now time to verify the survival probabilities predicted by the top graph above. The graph object has a function called
predict which we call with the validation set
valid as parameter. One could also call it with
best_graph = qgraph.select(train) df_pred = valid.copy() df_pred['predicted'] = best_graph.predict(valid)
We then employ a set of metrics to evaluate the chosen model:
How many people were correctly classified as survivors (
True) and victims (
threshold = 0.5 y_pred = np.where(df_pred.predicted > threshold, True, False) y_true = df_pred[target].astype(bool) plt.figure(figsize=(8, 5)) plot_confusion_matrix(y_true, y_pred)
train_pred = best_graph.predict(train) print('Overall training accuracy: %.4f' %np.mean(np.round(train_pred) == train[target])) print('Overall validation accuracy: %.4f' %np.mean(y_pred == y_true))
Overall training accuracy: 0.8038 Overall validation accuracy: 0.7803
Histograms of the probability scores
The idea here consists of checking the probability distribution scores for the positive (survivors) and negative (non-survivors) classes.
def pos_neg_classes(y_true, y_pred): """Finds the probability distribution of the positive and negative classes""" # Hits and non-hits hits = y_pred[np.round(y_pred) == y_true] # Finds the indices and values where the prediction gets is right: TP and TN non_hits = y_pred[np.round(y_pred) != y_true] # Finds the indices and values where the prediction gets it wrong: FP and FN # From the whole set of predictions, what is actually from the positive and negative classes? pos_class = np.append(hits[np.round(hits) == 1], non_hits[np.round(non_hits) == 0]) # TP and FN neg_class = np.append(hits[np.round(hits) == 0], non_hits[np.round(non_hits) == 1]) # TN and FP return pos_class, neg_class def plot_hist(pos_class, neg_class, title): """Plot the desired histograms""" plt.hist(neg_class, label='Negative Class', color='teal', ec='darkslategrey', lw=1.5) plt.hist(pos_class, label='Positive Class', color='crimson', ec='darkred', lw=1.5, alpha=0.7) plt.legend(loc='upper center', fontsize=12) plt.ylabel('Number of ocurrences', fontsize=14) plt.xlabel('Probability Score', fontsize=14) plt.title(title, fontsize=14) ### plt.figure(figsize=(18, 4)) plt.subplot(131) # For all passengers pos_class, neg_class = pos_neg_classes(df_pred.survived, df_pred.predicted) plot_hist(pos_class, neg_class, title='Total probability distribution') plt.subplot(132) # For men only pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'male'].survived, df_pred[df_pred.sex == 'male'].predicted) plot_hist(pos_class, neg_class, title='Probability distribution for males') plt.subplot(133) # For women only pos_class, neg_class = pos_neg_classes(df_pred[df_pred.sex == 'female'].survived, df_pred[df_pred.sex == 'female'].predicted) plot_hist(pos_class, neg_class, title='Probability distribution for females') plt.show()
The histograms above give an initial idea of model performance through the probability score distributions. For instance, a naive approach to this problem would be to predict all women as survivors and all men as victims. Does that happen in this case?
Receiving Operating Characteristic (ROC) curve
Yet another common metric for evaluating binary classification models. It helps on setting an optimal threshold for the probability scores. Additionally, it can be an auxiliary tool in model selection.
fpr, tpr, threshs = roc_curve(df_pred.survived, df_pred.predicted)
plt.figure(figsize=(8, 5)) plt.plot(fpr, tpr, color='navy', lw=1.7) plt.plot(np.linspace(0, 1, 100), np.linspace(0, 1, 100), 'r--', lw=1.7) plt.xlabel('False Positive rate', fontsize=14) plt.ylabel('True Positive rate', fontsize=14) plt.title('ROC curve', fontsize=14) plt.show()
It is interesting to note from the Titanic dataset that no matter how much one trains a model, there is always a factor of luck which cannot be predicted. From a 3rd class man who pretended to be a woman to get onto a lifeboat to a 1st class woman who chose to stay with her husband aboard the Titanic: humans are fully capable of bending their fate.
Suggestions for next steps:
Now that we went through the first exploration of
Feyn to a binary classification problem, there are a few extra things one could try:
- Changing the number of updating loops and epochs;
- Fitting and selecting graphs according to different loss functions;
- Setting the graphs max_depth to higher or smaller values, such as 5 or 3;
- Selecting the best graph according to the validation set (
valid) instead of
A summary of the full set of steps are depicted below: from resetting the QLattice, to naming the registers and the training itself. Feel free to explore and have fun by changing this notebook as you wish! :D
# Resetting QLattice ql.reset() # Assigning registers in_regs =  for var in train.columns: if type(train[var].iloc) is str: in_regs.append(ql.get_register(var, register_type='cat')) else: in_regs.append(ql.get_register(var, register_type='fixed')) out_reg = ql.get_register(target) # Defining the number of loops/updates and epochs nloops = 5 nepochs = 10 # Training for loop in range(nloops): # (5) qgraph = ql.get_qgraph(in_regs, out_reg, max_depth=4) # (1) qgraph.fit(train, epochs=nepochs, loss_function='mean_squared_error', threads=4, show='graph') # (2) best = qgraph.select(train, loss_function='mean_squared_error') # (3) ql.update(best) # (4)