Analysing and selecting hypotheses
by: Chris Cave & Meera Machado
In this section we lay the arsenal of tools to analyse and aid in the selection of the most interesting hypotheses. In the previous page we posed the question of how mean area
combines with mean concave points
to predict the target variable. We then generated a list of hypotheses, the QGraph
, that could answer said question. Lastly, we selected qgraph[0]
as the hypothesis to further investigate. This process is shown below:
import sklearn.datasets
import pandas as pd
import feyn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
breast_cancer = sklearn.datasets.load_breast_cancer()
input_columns = breast_cancer.feature_names
# Load into a pandas dataframe
data = pd.DataFrame(breast_cancer.data, columns=input_columns)
data['target'] = pd.Series(breast_cancer.target)
# Split into train, validation and holdout
train, valid = train_test_split(data, test_size = 0.4, stratify = data['target'], random_state = 42)
valid, holdo = train_test_split(valid, test_size = 0.5, stratify = valid['target'], random_state = 42)
# Pose a question to QGraph
ql = feyn.QLattice()
qgraph = ql.get_classifier(['mean area', 'mean concave points'], output='target', max_depth=1)
for _ in range(20):
qgraph.fit(train, threads=4)
ql.update(qgraph.best())
hypo_mean_area_conc_points = qgraph[0]
Our hypothesis states that the target
variable is a linear function of mean area
and mean concave points
. We can convert this graph to a mathematical equation using SymPy
:
hypo_mean_area_conc_points.sympify(signif = 3)
As this is a classification problem the linear relationship is passed to a logisitic function to obtain values between 0 and 1. This represents the probability of a tumor being benign.
Analysing a hypothesis
Let's check the performance of the hypothesis above by plotting the ROC-curve on the train and validation sets. This can also tell us about overfitting, especially if their AUC scores differ significantly.
hypo_mean_area_conc_points.plot_roc_curve(train, label = 'train')
hypo_mean_area_conc_points.plot_roc_curve(valid, label = 'valid')
The ROC curve above tells us that the hypothesis' predictions are not just result of random guessing. A ROC curve can be complemented by plotting the probability scores, i.e. the values predicted by our current hypothesis.
from feyn.__future__.contrib.inspection import plot_probability_scores
y_train_true = train['target'].copy()
y_train_pred = hypo_mean_area_conc_points.predict(train)
plot_probability_scores(y_train_true, y_train_pred, title='training set')
The higher the AUC score, the easier it will be to separate the negative (target
= 0) and positive (target
= 1) classes in the plot below.
We can also visualise the hypothesis with a two-dimensional partial plot:
hypo_mean_area_conc_points.plot_partial2d(train)
The red dots above represent the positive class, benign tumors, while the blue ones represent the negative classs, malignant tumors. The background colours represent what our hypothesis predicts: the yellow regions as 1, and the grey regions as 0. Since the hypothesis is a linear function it formed a straight boundary between the red and blue dots and the ROC curve aboves shows that this is a good separation.
How can we point to the feature values the model has difficulties classifying. This is when we use the plot_segmented_loss
method.
fig = plt.figure(figsize=(20,6))
ax = fig.add_subplot(121)
hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean area', ax=ax)
ax = fig.add_subplot(122)
hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean concave points', ax=ax)
In the left diagram the blue bars is the histogram of mean area
. On top of the histogram is a pink curve where each point is the average loss across that bin. For mean area
values greater than 1250 the average loss is very close to zero. Meanwhile the loss is much higher between the values 500 and 750. This means the hypothesis yields predictions closer to the observed data when mean area
than for mean area
. We can say something similar about mean concave points
.
Concluding remarks
Let's go back to our overarching question: 'What is evidence for a malignant or benign tumor?'. Throughtout the sessions, we refined it to a question on how does mean area
and mean concave points
predict the target
variable. At last we reached the hypothesis that the probability of having a benign tumor is the linear function of mean area
and mean concave points
shown above.
In addition to testing this hypothesis with more data, like the holdout or newly gathered data, we can refine the question, come up with new hypotheses, etc. In order words, our investigation of this dataset doesn't need to stop now. In the next session, we discuss possible next steps.