Feyn

Feyn

  • Tutorials
  • Guides
  • API Reference
  • FAQ

›Using Feyn

Getting Started

  • Quick Start

Using Feyn

  • Introduction to the basic workflow
  • Asking the right questions
  • Formulate hypotheses
  • Analysing and selecting hypotheses
  • What comes next?

Essentials

  • Defining input features
  • Classifiers and Regressors
  • Filtering a QGraph
  • Predicting with a graph
  • Saving and loading graphs
  • Updating your QLattice

Plotting

  • Graph summary
  • Partial plots
  • Segmented loss
  • Goodness of fit
  • Residuals plot

Setting up the QLattice

  • Installation
  • Accessing your QLattice
  • Firewalls and proxies
  • QLattice dashboard

Advanced

  • Causal estimation
  • Converting a graph to SymPy
  • Feature importance estimation
  • Setting themes
  • Saving a graph as an image
  • Tuning the fitting process

Future

  • Future package
  • Diagnostics
  • Inspection
  • Reference
  • Stats
  • Plots

What comes next?

by: Meera Machado & Chris Cave

The hypothesis in the previous section combined mean area and mean concave points to differentiate between malignant and benign tumors. However there could be other features, other relationships to explore. So the question really comes to:

Are we satisfied with the hypothesis we've chosen?

A guideline to answer this question might be in assessing whether your hypothesis points towards something we didn't know before.

If we are not satisfied, then we can go back and refine the questions we've made based on our analysis so far. For example we could ask how much can mean concave points explain the target variable on its own or whether it can combine with other features that do not correlate so heavily with it (check Pearson's correlation between mean area and mean concave points).

On the other hand, if we are satisfied we should check whether other sets of observations agree or conflict with the hypothesis. For instance we could test our hypothesis on women from different parts of the world. We could collect data on mean area and mean concave points that we were not part in our initial observations.

Sometimes it is not possible to perform another experiment. The next best thing is to test the hypothesis on the holdout set. The disadvantage of this is that the data comes from the same place as the initial training data and so it inherits the same observational bias.

import sklearn.datasets
import pandas as pd
import feyn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

breast_cancer = sklearn.datasets.load_breast_cancer()
input_columns = breast_cancer.feature_names

# Load into a pandas dataframe
data = pd.DataFrame(breast_cancer.data, columns=input_columns)
data['target'] = pd.Series(breast_cancer.target)

# Split into train, validation and holdout
train, valid = train_test_split(data, test_size = 0.4, stratify = data['target'], random_state = 42)
valid, holdo = train_test_split(valid, test_size = 0.5, stratify = valid['target'], random_state = 42)

# Connecting to QLattice
ql = feyn.QLattice()

# Pose a question to QGraph [*]
qgraph = ql.get_classifier(['mean area', 'mean concave points'], output='target', max_depth=1)

for _ in range(20):
    qgraph.fit(train, threads=4)
    ql.update(qgraph.best())

# Selecting hypothesis [*]
hypo_mean_area_conc_points = qgraph[0]

# Analyse hypothesis [*] --> refer to previous section

Notice the [*] symbols in the cell above. They indicate the points of iteration in the question-hypothesis process. In other words, they represent where we pose questions, extract hypotheses and analyse them to further refine said questions and get more robust hypotheses.

Suppose we are satisfied with the hypothesis that the target variable is a linear function of mean area and mean concave points. Let's see how it performs on the holdout set:

# Testing hypothesis on unseen data (holdout)
hypo_mean_area_conc_points.plot_roc_curve(holdo)

Holdo rocks

From the AUC score and the ROC curve, we clearly see that this simple hypothesis generalises to the holdout set.

If the hypothesis had not generalised to the holdout set, then we are in trouble. Our holdout set is contaminated, i.e. the knowledge on this performance would bias further investigations. Basically the houldout set became another validation set. Ideally, we should get more unseen data. If that is not possible, then passing another seed to the train/valid/holdo split is another option. Though it should be taken with a grain of salt.

← Analysing and selecting hypothesesDefining input features →
Copyright © 2021 Abzu.ai
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®