# Interpreting a graph

Feyn version: 1.4.+

In this tutorial we will take a look at how one can go about intrepreting a graph. It involves making visualisations of each interaction of a graph. If you're unsure what a graph is or what an interaction is then it's recommended to look at the guide on graphs before moving on.

The data set we are focusing on is medical bills of outpatients across the United States. The dataset can be found on Kaggle

A little bit about this dataset: 1338 patients have had their following information recorded:

- age
- gender
- bmi
- amount of children
- smoking or non-smoking
- region of the US
- their medical costs

The idea is to find out drives their medical costs from this information.

The goal of this tutorial is to investiagte what are relations the graph has found between the features and the target variable (`out_patients`

). We can somewhat think of `out_patients`

as a proxy variable of capturing how healthy someone is. Typically higher medical bills would point towards someone who is less healthy than one with lower medical bills. Of course this is just a rough interpretation of this variable as there are likely a lot of other nuances to take into account.

We're not going to train a QGraph in this tutorial. If you want to learn how a QGraph is fitted then check out the other tutorials: airbnb tutorial for regression, and titanic tutorial for classification. Instead we will load a pretrained one and demonstrate an approach one might take to extract interesting insights from the data using the graph.

```
import feyn
# This is the feyn.__future__ package where new tools are being added to be tested out
from feyn.__future__.contrib.inspection import plot_interaction, plot_categories, get_activations_df
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

```
# This is the train and test split we used to train the model.
train = pd.read_csv('insurance-train.csv',index_col=0)
test = pd.read_csv('insurance-test.csv',index_col=0)
train.head()
```

age | sex | bmi | children | smoker | region | out_charges | |
---|---|---|---|---|---|---|---|

340 | 24 | female | 27.60 | 0 | no | southwest | 18955.22017 |

532 | 59 | male | 29.70 | 2 | no | southeast | 12925.88600 |

518 | 35 | female | 31.00 | 1 | no | southwest | 5240.76500 |

1203 | 51 | male | 32.30 | 1 | no | northeast | 9964.06000 |

336 | 60 | male | 25.74 | 0 | no | southeast | 12142.57860 |

## The pretrained graph

Here's what we made earlier.

```
graph = feyn.Graph.load('insurance-graph-smoker-bmi-age')
graph
```

### Benchmarking

First of all let's see how well it's done.

```
graph.r2_score(train)
```

```
0.7770082766961321
```

```
graph.r2_score(test)
```

```
0.7501348493587797
```

Nicely enough. This simple graph is capturing a lot of the variance. Let's take a closer look.

```
graph.plot_regression_metrics(train)
```

```
graph.plot_regression_metrics(test)
```

These regression metrics show that the graph is good at predicting low and high `out_charges`

but less sure within the middle. We can also see this when we segment the loss by the `out_charges`

.

```
graph.plot_segmented_loss(train,by='out_charges')
```

There's some sense to this. It's easy to predict when someone is healthy and when someone is very unhealthy but it's harder when people are in the middle of this.

## Interpreting the model

The advantage of feyn is opening up the innards of the model to show what's driving predictions.

First let's see how the signal flows through the graph.

```
graph.plot_summary(train)
```

We can see that `age`

on it's own is a strong indicator for the `out_charges`

. As features on their own, `bmi`

and `smoker`

does not capture of the signal. However the biggest contributor is when we combine all three features in the `add`

interaction.

We are going to investigate what each interaction in the above graph. We want to answer what correlations has the graph found between `smoker`

, `age`

, `bmi`

and the `target`

.

In the function below we plot the output of each interaction for each data point in the `train`

dataset.

### Smoker

The `smoker`

feature is a category with two categories: `Yes`

and `No`

. These are converted into two weights which are learnt during the fit process. We plot these weights below:

Then the `smoker`

feature is a category with two categories: `Yes`

and `No`

, and is converted into two weights. These weights are also learnt.

```
fig = plot_categories(graph[1])
fig.show(renderer = 'svg')
```

### Multiplier between smoker and bmi

Next is the `multiply`

interaction between `smoker`

and `bmi`

. There's a bit of an explaination of what's happening here. First the feature `bmi`

is linearly scaled, which are parameters that the graph has learnt. Then it is multiplied with the weights of the categories in `smoker`

.

```
fig = plot_interaction(graph,graph[3],train)
fig.show(renderer='svg')
```

There's a bit more to say about this plot. The **colour** of the dot corresponds to the output of the `multiply`

interaction so the range of the `multiply`

output is approximately between -0.5 and 3.

Look at the difference between gradients of smokers and non-smokers. The gradient on the non-smokers shows that the output is between -0.5 and 0 while on the smokers the output is between 1 and 3.

The `bmi`

is adjusting the gradient is each group. A low `bmi`

is giving values around -0.5 for non-smokers and between 1 and 2 for smokers. Likewise a high `bmi`

is giving values around 0 for non-smokers and greater than 2 for smokers.

We will be thinking about these values in the next plot.

### Addition interaction

Let's take a look at the `add`

interaction with the `multiply`

and the `age`

.

```
fig = plot_interaction(graph,graph[4],train)
fig.show(renderer='svg')
```

As we saw in the previous plot, the `multiply`

output for non-smokers is between -0.5 and 0 while the output for smokers is more spread between 1 and 3. That is why we are seeing one very narrow strip and one broad strip corresponding to non-smokers and smokers respectively.

The colour corresponds to predictions of the `out_charges`

. We can see a colour gradient in these two strips as well which shows that the older the patient the higher the predicted `out_charges`

. However the variance in colour is larger in the the smoker strip than the non-smoker strip. This is because in the previous plot the `bmi`

was giving a broader variance in the `multiply`

output for smokers than it was for non-smokers.

### Conclusion

In summary, being a smoker and one's age has an effect on `out_charges`

. However the interesting point is that `bmi`

has a larger effect on the `out_charges`

for smokers than for non-smokers. This is demonstrated by the two different sized strips we see in the plot above and the large variance of colour variance in the smoker strip than the non-smoker strip.