by: Kevin Broløs
(Feyn version 1.4.7 or newer)
Many other ML tools have support for calculating
SHAP values and
feyn is no different. It's often useful for data scientists to use
SHAP as a method for estimating feature importances, or getting an overview of which features contribute where, and when.
The graphs in
feyn are easily translatable to mathematical functions and are as such immediately inspectable. You can also use, for instance, the partial plots or the signal summary to get an idea of how data points move through the graph and how they affect the outcome.
Sometimes, you'd like to know more about what features contribute to a decision, and you might have a handful of decisions you want to get explanations for, and that's where
SHAP can come in useful - and especially our flavour, that relates it directly to the samples themselves.
Because the underlying engine here is
SHAP, it's important to understand the uncertainties related to this and that it's just an estimate of how the graph acts under different inputs (and it might even explore areas of the data that is not strictly possible).
If you want to learn more about this, we suggest reading more up on what
SHAP is and how it works - and which assumptions it makes of your data (for instance, it assumes that your features are independent).
What would you like to know?
Before you use this, it often makes sense to consider what kind of question you're asking. For instance - you might have a bunch of false positives in your graph you'd like to understand more about. You might also notice that some samples in your regression graph are very mispredicted, and you want to understand why.
In this example, we'll use a simple classification that we on purpose can't fully fit with our chosen graph size, and filter our data to ask the question we want to know more about - the false positives.
We're not going to worry with fitting multiple times or updating the
QLattice in this instance - we just fit once and select the best graph of a maximum depth of 1 - which will allow it to use at most two features as inputs.
from feyn import QLattice from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import pandas as pd breast_cancer = load_breast_cancer() input_columns = breast_cancer.feature_names # Load into a pandas dataframe data = pd.DataFrame(breast_cancer.data, columns=input_columns) data['target'] = pd.Series(breast_cancer.target) train, test = train_test_split(data, stratify=data.target) # Fitting ql = QLattice() ql.reset() qgraph = ql.get_classifier(train, 'target', max_depth=1) qgraph.fit(train, threads=4) best = qgraph best
After fitting the graph, we'll finally evaluate it on the test set, and dive deeper into the data points we'd like to know more about - simulating a situation in a real-life scenario where a running model produces a wrong prediction you'd like to be able to explain.
As we hoped, this graph was too restricted to fit the problem, and we have some false positives to inspect.
def filter_FP(prediction_series, truth_series, truth_df): pred = prediction_series > 0.5 return truth_df.query('@truth_series != @pred and @truth_series == False') predictions = best.predict(test) false_positives = filter_FP(predictions, test.target, test) false_positives
This gives us a list of all the samples in the test set that the model predicted to be True, but were supposed to be False.
Of course, all of this can also be done with a dictionary of
numpy arrays if you prefer that over
pandas - but the preparation code will have to be adjusted for working with that.
Now that we know what we want to know more about, it's simpler to use and understand the outputs of the tool. The following code calculates the
SHAP value for our filtered data - the false positives - and uses the training set as reference for background data. When in doubt, you should always use the dataset your model was trained on for reference - since it describes its behaviour best.
For this case, we have very few features that the graph ended up using, and a large table. We can make this a bit more sensible by filtering by the graph features (but you don't have to):
features = best.features table = FeatureImportanceTable(best, false_positives[features], bg_data=train[features]) table
You'll notice from this table, that it renders each sample in a dataframe and plots a bar chart on each column. This shows the degree to which each feature contributed towards the negative (False) or positive direction (True) for the prediction. For a regression case, this would be the same, but centered around the base value (the mean prediction).
You can also choose to render it as a heatmap instead, if you prefer that to reading it as a chart.
You can do it by adjusting the
styling property on the returned table object, or by passing it as an argument.
table.styling.style = 'fill' table
is equivalent to:
table = FeatureImportanceTable(best, false_positives[features], bg_data=train[features], style='fill') table
but saves recomputing the values.
What about normal SHAP visualizations?
You can use the internals of this function to input into your favourite
SHAP plotting library, or you can even just use our
SHAP implementation directly in
feyn.insigths.KernelShap. We won't go into details on how to use this, as it's of similar nature to the process above, and all other
SHAP implementations out there. You can check the API Reference for more info.
You can get the
SHAP importance values from the internals of the table like this
importances = table.importance_values
array([[-0.00136194, 0.14264939], [ 0.00926901, 0.02617117], [ 0.01086387, 0.12908358], [ 0.0172759 , 0.10280851], [-0.00977484, 0.21837655], [ 0.01167485, 0.23540937], [ 0.00294184, -0.03993147], [ 0.00752346, 0.37022395], [ 0.0003935 , -0.04824079], [-0.00235883, 0.14267193]])
And you can access the importances for the background data using
bg_importances = table.bg_importance
array([[ 2.91090801e-03, 3.64154272e-01], [-5.14067832e-03, 3.46074125e-01], [ 2.75646539e-03, 3.77098630e-01], (...) [ 1.10805174e-03, -6.12986538e-01], [ 6.66648740e-03, 9.49804651e-02], [-1.39316220e-02, 8.02177086e-02]])
Hopefully, this was a helpful introduction on how you can use your existing skillset of
feyn, and get some estimates on feature importances.