# Preventing the Honeybee Apocalypse (QSAR)

by: Niels Johan Christensen and Samuel Demharter

Feyn version: 2.0

Honey bees are at risk!

Many insectides are based on similar chemical scaffolds as nerve agents. Exposure to these agents leads to lethal interruption of nerve transmission by inhibition of the enzyme (acetylcholinesterase) that degrades the neurotransmitter **acetylcholine**.

Can we use the `QLattice`

to make interpretable and generalisable **Quantitative Structure Activity Relationship (QSAR)** predictions?
I.e. in this case, can we find the combination of chemical descriptors that best predict **honey bee toxicity**? This model would be helpful for chemists who want to develop honey-bee-friendly pesticides.

- The target variable for this regression exercise is
**-log(LD50)**. This is a measure of toxicity of the compounds. - The dataset we'll be looking at was published (find the model here) and included 16 chemical descriptors. Can we improve on the published model?
*Note, the published prediction is an ensemble average of models and thus is not very interpretable.*

Reference: Dulin, F.; Halm-Lemeille, M.-P.; Lozano, S.; Lepailleur, A.; Sopkova-de Oliveira Santos, J.; Rault, S.; Bureau, R. Interpretation of honeybees contact toxicity associated to acetylcholinesterase inhibitors. Ecotoxicol. Environ. Saf. 2012, 79, 13–21. http://dx.doi.org/10.1016/j.ecoenv.2012.01.007

### Let's import some libraries

```
import pandas as pd
import feyn
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
```

```
# Read the data
# Note, the column names are abbreviated for convenience.
df = pd.read_csv("../data/honeybee.csv", index_col=0)
```

```
df.head()
```

Meaninformationindex1 | PresenceabsenceofSPa2 | PresenceabsenceofCCa3 | Lowesteigenvalue3ofB4 | RCXXRCXXCX5 | ComplementaryInforma6 | Dipolemomentcomponen7 | Gearyautocorrelation8 | Gearyautocorrelation9 | GhoseViswanadhanWend10 | Energyofthelowestuno11 | Moranautocorrelation12 | Shadowindexcomponent13 | Shadowindexcomponent14 | StructuralInformatio15 | Estatetopologicalpar16 | 48hHoneybeetoxicitya17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Compound Id | |||||||||||||||||

S1 | 2.023 | 1 | 1 | 1.317 | 1 | 0.713 | -2.54264 | 1.103 | 0.000 | 0 | -0.10516 | 0.213 | 0.66986 | 0.63133 | 0.835 | 30.447 | 2.188 |

S2 | 1.744 | 0 | 1 | 1.693 | 2 | 0.890 | -1.60728 | 1.025 | 0.997 | 0 | -0.07309 | -0.069 | 0.66613 | 0.71244 | 0.843 | 79.169 | 2.773 |

S3 | 1.741 | 0 | 1 | 1.293 | 0 | 1.039 | 0.19352 | 1.195 | 1.286 | 0 | -0.07899 | -0.226 | 0.54579 | 0.62523 | 0.779 | 27.796 | 2.830 |

S4 | 1.533 | 0 | 1 | 1.544 | 0 | 1.292 | -1.12375 | 1.045 | 1.944 | 0 | -0.02942 | -0.264 | 0.74537 | 0.79094 | 0.739 | 21.392 | 3.239 |

S5 | 2.099 | 1 | 1 | 1.584 | 1 | 0.985 | -1.33811 | 0.451 | 0.720 | 1 | -0.13295 | 0.583 | 0.54227 | 0.66500 | 0.803 | 38.281 | 2.881 |

### Split the data into train and test

We will use the same train/test split as in the published QSAR model

```
test = df.loc[['S2','S7','S25','S41','S42','S44']]
train = df.drop(index=['S2','S7','S25','S41','S42','S44'])
target = df.columns[-1]
```

## Strategy 1: Create a linear QSAR model

Typically, QSAR models are linear. Only recently has the exploration of non-linear QSAR modelling gained traction.
Luckily we can do both with the `QLattice`

. Let's start building a linear model and then continue with non-linear models.

To this end we:

- only allow "add" (linear) functions
- run 20 fitting loops or "epochs"
- use the bic criterion (bayesian information criterion) to keep things simple

### Initialize a Community QLattice

```
ql = feyn.connect_qlattice()
ql.reset(random_seed=42)
```

### Run the QLattice over 20 epochs

```
models = ql.auto_run(train[train.columns], target, n_epochs=20, criterion='bic',function_names = ["add"] )
```

### Evaluate the performance on the training and test data

```
models[0].plot(train, test)
```

```
# Regression plot for the training set
models[0].plot_regression(train)
```

```
# Regression plot for the test set
models[0].plot_regression(test)
```

Nice! The performance of this single linear model is comparable to the performance of the ensembl model presented in the literature.

## Strategy 2: Create a simple and non-linear QSAR model

Now let's do something non linear.

- We'll start out with just 3 epochs of symbolic regression while also keeping the graph complexity low (
**max_complexity = 3**). - We'll run the
`QLattice`

with the full function repertoire (in addition to "add" there are currently 9 other functions. See them listed under "interactions" here).

```
ql.reset(random_seed=42)
models = ql.auto_run(train[train.columns], target, max_complexity = 3, n_epochs=3)
```

### The QLattice arrives at two-descriptor only model. The test metrics look really good:

```
models[0].plot(train, test)
```

We got a decent simple model that can easily be interpreted. Two features interacting in an elegant non-linear relationship.

### How do our predictions compare to the actual toxicity?

```
models[0].plot_regression(train)
```

```
models[0].plot_regression(test)
```

This simple QSAR model is based on only two descriptors, namely:

**Presence/absence of S–P at topological distance 1**: This descriptor encodes the presence (=1) or absence (=0) of a sulfur atom directly bound to phosphorus**Geary autocorrelation of lag 6 weighted by Sanderson electronegativity**: This particular descriptor "sums the products of atom weights of the terminal atoms of all the paths of the path length = 6 (the lag) weighted by electronegativity- A summary of this and other autocorrelaton descriptors is found here (http://www.vcclab.org/lab/indexhlp/2dautoc.html)

### Interpret the simple non-linear model

The **plot_partial2d** plot is useful for elucidating the interplay between molecular descriptors.

```
models[0].plot_partial2d(train)
```

It is clear from the partial2d plots below that

- (i) increasing values of
**Gearyautocolation9**is associated with increased toxicity - (ii) that the
**presence of a sulfur atom**directly attached to phosphorous amplifies this toxicity.

```
models[0].plot_partial2d(test)
```

### Show the model as a mathematical expression

```
sympy_model = models[0].sympify(signif=3)
sympy_model.as_expr()
```

$\displaystyle 3.36 - 6.99 e^{- 0.561 \left(0.991 Gearyautocorrelation_{9} + 1\right)^{2} - 0.312 \left(- PresenceabsenceofSPa_{2} - 0.979\right)^{2}}$

This is the model. Nothing is hidden away. You could copy and paste this formula into a excel worksheet and work with it. That's one of the benefits of symbolic regression.

## Strategy 3: Full-on QLattice modelling

Can we further improve the predictive performance of our model? Let's set the `QLattice`

loose and see what happens.

- Run the
`QLattice`

for 10 epochs using the**bic**criterion (see here for details) to keep things simple. - Keep
**default model complexity**(max 10 edges)

```
ql.reset(random_seed=42)
models = ql.auto_run(train[train.columns], target, n_epochs=10, criterion='bic')
```

```
models[0].plot(train,test)
```

Nice R2! This model outperforms the ensemble predictions in the literature.

```
models[0].plot_regression(train)
```

```
models[0].plot_regression(test)
```

The model captures the actual toxicity surprisingly well. Note, to fully validate this model we would want to test it on another set of compounds. Also, even though the model itself is explainable, the features it includes (autocorrelation, complementary information) are hardly intuitive. More work has to be done to come up with meaningful interpretable features that also hold predictive power.

### Conclusion

In this tutorial we have built **linear** and **non-linear** models using the `QLattice`

on a typical QSAR dataset.
The smart feature selection coupled with the simplicity of the models enables researchers to rapidly explore potential mechanisms underlying the data.

This example was based on a relatively **small dataset** (where simple models excel). But the `QLattice`

can readily be applied to large and wide datasets. We're looking into it ;)