Feyn

Feyn

  • Tutorials
  • Guides
  • API Reference
  • FAQ

›Using Feyn

Getting Started

  • Quick Start

Using Feyn

  • Introduction to the basic workflow
  • Asking the right questions
  • Formulate hypotheses
  • Analysing and selecting hypotheses
  • What comes next?

Essentials

  • Defining input features
  • Classifiers and Regressors
  • Filtering a QGraph
  • Predicting with a graph
  • Inspection plots
  • Saving and loading graphs
  • Updating your QLattice

Setting up the QLattice

  • Installation
  • Accessing your QLattice
  • Firewalls and proxies
  • QLattice dashboard

Advanced

  • Converting a graph to SymPy
  • Setting themes
  • Saving a graph as an image
  • Tuning the fitting process
  • Causal estimation

Future

  • Future package
  • Diagnostics
  • Inspection
  • Reference

Analysing and selecting hypotheses

by: Chris Cave & Meera Machado

In this section we lay the arsenal of tools to analyse and aid in the selection of the most interesting hypotheses. In the previous page we posed the question of how mean area combines with mean concave points to predict the target variable. We then generated a list of hypotheses, the QGraph, that could answer said question. Lastly, we selected qgraph[0] as the hypothesis to further investigate. This process is shown below:

import sklearn.datasets
import pandas as pd
import feyn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

breast_cancer = sklearn.datasets.load_breast_cancer()
input_columns = breast_cancer.feature_names

# Load into a pandas dataframe
data = pd.DataFrame(breast_cancer.data, columns=input_columns)
data['target'] = pd.Series(breast_cancer.target)

# Split into train, validation and holdout
train, valid = train_test_split(data, test_size = 0.4, stratify = data['target'], random_state = 42)
valid, holdo = train_test_split(valid, test_size = 0.5, stratify = valid['target'], random_state = 42)

# Pose a question to QGraph
ql = feyn.QLattice()
qgraph = ql.get_classifier(['mean area', 'mean concave points'], output='target', max_depth=1)

for _ in range(20):
    qgraph.fit(train, threads=4)
    ql.update(qgraph.best())

hypo_mean_area_conc_points = qgraph[0]

Hypothesis1

Our hypothesis states that the target variable is a linear function of mean area and mean concave points. We can convert this graph to a mathematical equation using SymPy:

hypo_mean_area_conc_points.sympify(signif = 3)

Sympify

As this is a classification problem the linear relationship is passed to a logisitic function to obtain values between 0 and 1. This represents the probability of a tumor being benign.

Analysing a hypothesis

Let's check the performance of the hypothesis above by plotting the ROC-curve on the train and validation sets. This can also tell us about overfitting, especially if their AUC scores differ significantly.

hypo_mean_area_conc_points.plot_roc_curve(train, label = 'train')
hypo_mean_area_conc_points.plot_roc_curve(valid, label = 'valid')

Roc curve 1

The ROC curve above tells us that the hypothesis' predictions are not just result of random guessing. A ROC curve can be complemented by plotting the probability scores, i.e. the values predicted by our current hypothesis.

from feyn.__future__.contrib.inspection import plot_probability_scores

y_train_true = train['target'].copy()
y_train_pred = hypo_mean_area_conc_points.predict(train)
plot_probability_scores(y_train_true, y_train_pred, title='training set')

The higher the AUC score, the easier it will be to separate the negative (target = 0) and positive (target = 1) classes in the plot below.

Probability scores

We can also visualise the hypothesis with a two-dimensional partial plot:

hypo_mean_area_conc_points.plot_partial2d(train)

Partial plot 2d

The red dots above represent the positive class, benign tumors, while the blue ones represent the negative classs, malignant tumors. The background colours represent what our hypothesis predicts: the yellow regions as 1, and the grey regions as 0. Since the hypothesis is a linear function it formed a straight boundary between the red and blue dots and the ROC curve aboves shows that this is a good separation.

How can we point to the feature values the model has difficulties classifying. This is when we use the plot_segmented_loss method.

fig = plt.figure(figsize=(20,6))
ax = fig.add_subplot(121)
hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean area', ax=ax)
ax = fig.add_subplot(122)
hypo_mean_area_conc_points.plot_segmented_loss(train, by='mean concave points', ax=ax)

Segmented loss

In the left diagram the blue bars is the histogram of mean area. On top of the histogram is a pink curve where each point is the average loss across that bin. For mean area values greater than 1250 the average loss is very close to zero. Meanwhile the loss is much higher between the values 500 and 750. This means the hypothesis yields predictions closer to the observed data when mean area >1250> 1250>1250 than for 500<500 <500< mean area <750< 750<750. We can say something similar about mean concave points.

Concluding remarks

Let's go back to our overarching question: 'What is evidence for a malignant or benign tumor?'. Throughtout the sessions, we refined it to a question on how does mean area and mean concave points predict the target variable. At last we reached the hypothesis that the probability of having a benign tumor is the linear function of mean area and mean concave points shown above.

In addition to testing this hypothesis with more data, like the holdout or newly gathered data, we can refine the question, come up with new hypotheses, etc. In order words, our investigation of this dataset doesn't need to stop now. In the next session, we discuss possible next steps.

← Formulate hypothesesWhat comes next? →
  • Analysing a hypothesis
  • Concluding remarks
Copyright © 2021 Abzu.ai