From the previous page, we started asking questions about our data. Our overarching question is: 'What is evidence for a malignant or benign tumor?'. Throughout the guides we will pose more specific questions and hypotheses which will help us in answering this main question.
We now want to use the
QLattice to help us find good hypotheses to these questions.
From questions to hypotheses
First we need to translate the question into a
QGraph. Take for instance the question 'Is area indicative of a tumor being benign or malignant?', the corresponding
QGraph would look like the following:
import sklearn.datasets import pandas as pd import feyn from sklearn.model_selection import train_test_split breast_cancer = sklearn.datasets.load_breast_cancer() input_columns = breast_cancer.feature_names # Load into a pandas dataframe data = pd.DataFrame(breast_cancer.data, columns=input_columns) data['target'] = pd.Series(breast_cancer.target) # Split into train, validation and holdout train, valid = train_test_split(data, test_size = 0.4, stratify = data['target'], random_state = 42) valid, holdo = train_test_split(valid, test_size = 0.5, stratify = valid['target'], random_state = 42)
Throughout our investigation we will fix a train/validation/holdout split. We will generate hypotheses based on the
train set, analyse and select them on the
valid set. Lastly, when we settle on a hypothesis we are satisfied with we will test it on the
# Pose question to QGraph ql = feyn.QLattice() qgraph = ql.get_classifier(['mean area'], 'target', max_depth=1)
We have told the
QGraph to take
mean area as an input feature and to map it to the output,
max_depth = 1 ensures that the input feature enters a single interaction cell as we will soon show.
QGraph is a list of potential hypotheses to the question. At the moment, the
QGraph has no knowledge of the data. So any hypothesis in the
QGraph will be random.
In order to change that we need to fit the
QGraph to the data.
# Fitting loop for _ in range(20): qgraph.fit(train, threads=4) ql.update(qgraph.best())
We can find the best hypotheses in the
QGraph by calling
qgraph.best(). This is usually a list of 3 to 4 top hypotheses based on the closest fit to the data i.e. a loss function.
QLattice is an environment that searches all possible hypotheses to questions posed to it. In order to refine its search we need to tell the
QLattice the best hypotheses we have seen so far. We do this by calling
Note that you can choose the number of threads to allow for parallel
QGraph fitting. That will accelerate the process.
After this fitting loop we want to take a look at what the
QLattice came up with:
The graphs above are the hypotheses. They illustrate what we said about
mean area entering a single interaction cell to predict the
Let's increase complexity by posing another question to the
QGraph: 'What feature could
mean area combine with to predict the
qgraph = ql.get_classifier(['mean texture', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension'], output='target', max_depth=1)\ .filter(feyn.filters.Contains('mean area')) for _ in range(50): qgraph.fit(train, threads=4) ql.update(qgraph.best())
We want to explore the possible features
mean area could combine to predict the
target. So we include a list of potential ones. We choose the means because we are trying to find a simple set of explanations to the problem. The features
mean radius and
mean perimeter were excluded since they correlate heavily with
We use the
filters module to ensure that only hypotheses that correspond the question being asked are generated. The QGraph will then only show hypotheses that satisfy the conditions imposed by the filters. In the case above,
feyn.filters.Contains('mean area') ensures that
mean area is included in every hypothesis in the QGraph.
In the example above
mean concave points was the most prevalent, suggesting that it is one of the features that best combine with
mean area. Note how the loss has decreased in comparison to the previous figure. This leads us to refine our previous question: "how does
mean area and
mean concave points combine to predict the target variable?".
qgraph = ql.get_classifier(['mean area', 'mean concave points'], output='target', max_depth=1) for _ in range(20): qgraph.fit(train, threads=4) ql.update(qgraph.best())
The next step consists in selecting a hypothesis and diving deeper into it. The
QGraph is sorted by loss, so
qgraph is the hypothesis that best fit the data passed to the
QGraph. Most likely this will be the hypothesis we further explore.