In this section we lay the arsenal of tools to analyse and aid in the selection of the most interesting hypotheses. In the previous page we posed the question of how
mean area combines with
mean concave points to predict the target variable. We then generated a list of hypotheses, the
QGraph, that could answer said question. Lastly, we selected
qgraph as the hypothesis to further investigate. This process is shown below:
import sklearn.datasets import pandas as pd import feyn import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split breast_cancer = sklearn.datasets.load_breast_cancer() input_columns = breast_cancer.feature_names # Load into a pandas dataframe data = pd.DataFrame(breast_cancer.data, columns=input_columns) data['target'] = pd.Series(breast_cancer.target) # Split into train, validation and holdout train, valid = train_test_split(data, test_size = 0.4, stratify = data['target'], random_state = 42) valid, holdo = train_test_split(valid, test_size = 0.5, stratify = valid['target'], random_state = 42) # Pose a question to QGraph ql = feyn.QLattice() qgraph = ql.get_classifier(['mean area', 'mean concave points'], output='target', max_depth=1) for _ in range(20): qgraph.fit(train, threads=4) ql.update(qgraph.best()) hypo_mean_area_conc_points = qgraph
Our hypothesis states that the
target variable is a linear function of
mean area and
mean concave points. We can convert this graph to a mathematical equation using
hypo_mean_area_conc_points.sympify(signif = 3)
As this is a classification problem the linear relationship is passed to a logisitic function to obtain values between 0 and 1. This represents the probability of a tumor being benign.
Analysing a hypothesis
Let's check the performance of the hypothesis above by plotting the ROC-curve on the train and validation sets. This can also tell us about overfitting, especially if their AUC scores differ significantly.
hypo_mean_area_conc_points.plot_roc_curve(train, label = 'train') hypo_mean_area_conc_points.plot_roc_curve(valid, label = 'valid')
The ROC curve above tells us that the hypothesis' predictions are not just result of random guessing. A ROC curve can be complemented by plotting the probability scores, i.e. the values predicted by our current hypothesis.
from feyn.__future__.contrib.inspection import plot_probability_scores y_train_true = train['target'].copy() y_train_pred = hypo_mean_area_conc_points.predict(train) plot_probability_scores(y_train_true, y_train_pred, title='training set')
The higher the AUC score, the easier it will be to separate the negative (
target = 0) and positive (
target = 1) classes in the plot below.
We can also visualise the hypothesis with a two-dimensional partial plot:
The red dots above represent the positive class, benign tumors, while the blue ones represent the negative classs, malignant tumors. The background colours represent what our hypothesis predicts: the yellow regions as 1, and the grey regions as 0. Since the hypothesis is a linear function it formed a straight boundary between the red and blue dots and the ROC curve aboves shows that this is a good separation.
How can we point to the feature values the model has difficulties classifying. This is when we use the
fig = plt.figure(figsize=(20,6)) ax = fig.add_subplot(121) hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean area', ax=ax) ax = fig.add_subplot(122) hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean concave points', ax=ax)
In the left diagram the blue bars is the histogram of
mean area. On top of the histogram is a pink curve where each point is the average loss across that bin. For
mean area values greater than 1250 the average loss is very close to zero. Meanwhile the loss is much higher between the values 500 and 750. This means the hypothesis yields predictions closer to the observed data when
mean area than for
mean area . We can say something similar about
mean concave points.
Let's go back to our overarching question: 'What is evidence for a malignant or benign tumor?'. Throughtout the sessions, we refined it to a question on how does
mean area and
mean concave points predict the
target variable. At last we reached the hypothesis that the probability of having a benign tumor is the linear function of
mean area and
mean concave points shown above.
In addition to testing this hypothesis with more data, like the holdout or newly gathered data, we can refine the question, come up with new hypotheses, etc. In order words, our investigation of this dataset doesn't need to stop now. In the next session, we discuss possible next steps.