Feyn Documentation

Feyn Documentation

  • Learn
  • Guides
  • Tutorials
  • API Reference
  • FAQ

›Regression

Overview

  • Tutorials

Beginner

    Classification

    • Titanic survival
    • Pulsar stars
    • Poisonous Mushrooms

    Regression

    • Airbnb prices
    • Automobile MPG
    • Concrete strength

Advanced

    Regression

    • Wine Quality

Use cases

  • Rewriting models with correlated inputs
  • Complexity-Loss Trade-Off
  • Plotting the loss graph
  • Simple linear and logistic regression
  • Deploy a model for inference

Life Sciences

    Classification

    • Detecting Liver Cancer (HCC) in Plasma
    • Classifying toxicity of antisense oligonucleotides

    Regression

    • Covid-19 RNA vaccine degradation data set
    • Preventing the Honeybee Apocalypse (QSAR)

Interfacing with R

  • Classifying toxicity of antisense oligonucleotides

Archive

  • Covid-19 vaccination RNA dataset.

Preventing the Honeybee Apocalypse (QSAR)

by: Niels Johan Christensen and Samuel Demharter

Feyn version: 2.0

Honey bees are at risk!

Many insectides are based on similar chemical scaffolds as nerve agents. Exposure to these agents leads to lethal interruption of nerve transmission by inhibition of the enzyme (acetylcholinesterase) that degrades the neurotransmitter acetylcholine.

Can we use the QLattice to make interpretable and generalisable Quantitative Structure Activity Relationship (QSAR) predictions? I.e. in this case, can we find the combination of chemical descriptors that best predict honey bee toxicity? This model would be helpful for chemists who want to develop honey-bee-friendly pesticides.

  • The target variable for this regression exercise is -log(LD50). This is a measure of toxicity of the compounds.
  • The dataset we'll be looking at was published (find the model here) and included 16 chemical descriptors. Can we improve on the published model? Note, the published prediction is an ensemble average of models and thus is not very interpretable.

Reference: Dulin, F.; Halm-Lemeille, M.-P.; Lozano, S.; Lepailleur, A.; Sopkova-de Oliveira Santos, J.; Rault, S.; Bureau, R. Interpretation of honeybees contact toxicity associated to acetylcholinesterase inhibitors. Ecotoxicol. Environ. Saf. 2012, 79, 13–21. http://dx.doi.org/10.1016/j.ecoenv.2012.01.007

Let's import some libraries

import pandas as pd
import feyn
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
# Read the data
# Note, the column names are abbreviated for convenience.
df = pd.read_csv("../data/honeybee.csv", index_col=0)
df.head()
Meaninformationindex1 PresenceabsenceofSPa2 PresenceabsenceofCCa3 Lowesteigenvalue3ofB4 RCXXRCXXCX5 ComplementaryInforma6 Dipolemomentcomponen7 Gearyautocorrelation8 Gearyautocorrelation9 GhoseViswanadhanWend10 Energyofthelowestuno11 Moranautocorrelation12 Shadowindexcomponent13 Shadowindexcomponent14 StructuralInformatio15 Estatetopologicalpar16 48hHoneybeetoxicitya17
Compound Id
S1 2.023 1 1 1.317 1 0.713 -2.54264 1.103 0.000 0 -0.10516 0.213 0.66986 0.63133 0.835 30.447 2.188
S2 1.744 0 1 1.693 2 0.890 -1.60728 1.025 0.997 0 -0.07309 -0.069 0.66613 0.71244 0.843 79.169 2.773
S3 1.741 0 1 1.293 0 1.039 0.19352 1.195 1.286 0 -0.07899 -0.226 0.54579 0.62523 0.779 27.796 2.830
S4 1.533 0 1 1.544 0 1.292 -1.12375 1.045 1.944 0 -0.02942 -0.264 0.74537 0.79094 0.739 21.392 3.239
S5 2.099 1 1 1.584 1 0.985 -1.33811 0.451 0.720 1 -0.13295 0.583 0.54227 0.66500 0.803 38.281 2.881

Split the data into train and test

We will use the same train/test split as in the published QSAR model

test = df.loc[['S2','S7','S25','S41','S42','S44']]
train = df.drop(index=['S2','S7','S25','S41','S42','S44'])
target = df.columns[-1]

Strategy 1: Create a linear QSAR model

Typically, QSAR models are linear. Only recently has the exploration of non-linear QSAR modelling gained traction. Luckily we can do both with the QLattice. Let's start building a linear model and then continue with non-linear models.

To this end we:

  • only allow "add" (linear) functions
  • run 20 fitting loops or "epochs"
  • use the bic criterion (bayesian information criterion) to keep things simple

Initialize a Community QLattice

ql = feyn.connect_qlattice()
ql.reset(random_seed=42)

Run the QLattice over 20 epochs

models = ql.auto_run(train[train.columns], target, n_epochs=20, criterion='bic',function_names = ["add"] )
Loss: 1.42E-01Epoch no. 20 - Tried 14520 models - Best loss: 1.42e-01(0, -1, -1) Meaninformationindex1 linear: scale=2.111932 w=-0.445662 bias=0.1356Meaninfo..0num(6, -1, -1) Dipolemomentcomponen7 linear: scale=0.247374 w=-1.013435 bias=1.5029Dipolemo..1num(2, -1, -1) PresenceabsenceofCCa3 linear: scale=2.000000 w=-0.473960 bias=-0.0057Presence..2num(35, 2, 1) cell:add(i,i)->iadd3x0x1(14, -1, -1) StructuralInformatio15 linear: scale=6.825939 w=-0.956530 bias=0.7992Structur..4num(36, 1, 0) cell:add(i,i)->iadd5x0x1(37, 0, 7) cell:add(i,i)->iadd6x0x1(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=-0.765692 bias=1.4403Gearyaut..7num(38, 0, 6) cell:add(i,i)->iadd8x0x1(39, 0, 5) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-0.5163 bias=-0.754348hHoney..9out

Evaluate the performance on the training and test data

models[0].plot(train, test)
(0, -1, -1) Meaninformationindex1 linear: scale=2.111932 w=-0.445662 bias=0.1356Meaninfo..0num(6, -1, -1) Dipolemomentcomponen7 linear: scale=0.247374 w=-1.013435 bias=1.5029Dipolemo..1num(2, -1, -1) PresenceabsenceofCCa3 linear: scale=2.000000 w=-0.473960 bias=-0.0057Presence..2num(35, 2, 1) cell:add(i,i)->iadd3x0x1(14, -1, -1) StructuralInformatio15 linear: scale=6.825939 w=-0.956530 bias=0.7992Structur..4num(36, 1, 0) cell:add(i,i)->iadd5x0x1(37, 0, 7) cell:add(i,i)->iadd6x0x1(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=-0.765692 bias=1.4403Gearyaut..7num(38, 0, 6) cell:add(i,i)->iadd8x0x1(39, 0, 5) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-0.5163 bias=-0.754348hHoney..9out-0.1-0.39-0.42-0.4-0.46-0.54-0.75-0.52-0.870.87-10+1Pearson correlationMetricsR20.762RMSE0.377MAE0.3Test MetricsR20.829RMSE0.19MAE0.17
# Regression plot for the training set
models[0].plot_regression(train)

png

# Regression plot for the test set
models[0].plot_regression(test)

png

Nice! The performance of this single linear model is comparable to the performance of the ensembl model presented in the literature.

Strategy 2: Create a simple and non-linear QSAR model

Now let's do something non linear.

  • We'll start out with just 3 epochs of symbolic regression while also keeping the graph complexity low (max_complexity = 3).
  • We'll run the QLattice with the full function repertoire (in addition to "add" there are currently 9 other functions. See them listed under "interactions" here).
ql.reset(random_seed=42)
models = ql.auto_run(train[train.columns], target, max_complexity = 3, n_epochs=3)
Loss: 2.90E-01Epoch no. 3 - Tried 1738 models - Best loss: 2.90e-01(1, -1, -1) PresenceabsenceofSPa2 linear: scale=2.000000 w=-0.197341 bias=-0.3865Presence..0num(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=0.510162 bias=0.5297Gearyaut..1num(47, 3, 5) cell:gaussian(i,i)->igaussian2x0x1(48, 4, 6) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-3.9283 bias=1.889048hHoney..3out

The QLattice arrives at two-descriptor only model. The test metrics look really good:

models[0].plot(train, test)
(1, -1, -1) PresenceabsenceofSPa2 linear: scale=2.000000 w=-0.197341 bias=-0.3865Presence..0num(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=0.510162 bias=0.5297Gearyaut..1num(47, 3, 5) cell:gaussian(i,i)->igaussian2x0x1(48, 4, 6) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-3.9283 bias=1.889048hHoney..3out-0.180.52-0.720.72-10+1Pearson correlationMetricsR20.515RMSE0.538MAE0.405Test MetricsR20.789RMSE0.212MAE0.159

We got a decent simple model that can easily be interpreted. Two features interacting in an elegant non-linear relationship.

How do our predictions compare to the actual toxicity?

models[0].plot_regression(train)

png

models[0].plot_regression(test)

png

This simple QSAR model is based on only two descriptors, namely:

  • Presence/absence of S–P at topological distance 1: This descriptor encodes the presence (=1) or absence (=0) of a sulfur atom directly bound to phosphorus
  • Geary autocorrelation of lag 6 weighted by Sanderson electronegativity: This particular descriptor "sums the products of atom weights of the terminal atoms of all the paths of the path length = 6 (the lag) weighted by electronegativity
  • A summary of this and other autocorrelaton descriptors is found here (http://www.vcclab.org/lab/indexhlp/2dautoc.html)

Interpret the simple non-linear model

The plot_partial2d plot is useful for elucidating the interplay between molecular descriptors.

models[0].plot_partial2d(train)

png

It is clear from the partial2d plots below that

  • (i) increasing values of Gearyautocolation9 is associated with increased toxicity
  • (ii) that the presence of a sulfur atom directly attached to phosphorous amplifies this toxicity.
models[0].plot_partial2d(test)

png

Show the model as a mathematical expression

sympy_model = models[0].sympify(signif=3)
sympy_model.as_expr()

3.36−6.99e−0.561(0.991Gearyautocorrelation9+1)2−0.312(−PresenceabsenceofSPa2−0.979)2\displaystyle 3.36 - 6.99 e^{- 0.561 \left(0.991 Gearyautocorrelation_{9} + 1\right)^{2} - 0.312 \left(- PresenceabsenceofSPa_{2} - 0.979\right)^{2}}3.36−6.99e−0.561(0.991Gearyautocorrelation9​+1)2−0.312(−PresenceabsenceofSPa2​−0.979)2

This is the model. Nothing is hidden away. You could copy and paste this formula into a excel worksheet and work with it. That's one of the benefits of symbolic regression.

Strategy 3: Full-on QLattice modelling

Can we further improve the predictive performance of our model? Let's set the QLattice loose and see what happens.

  • Run the QLattice for 10 epochs using the bic criterion (see here for details) to keep things simple.
  • Keep default model complexity (max 10 edges)
ql.reset(random_seed=42)
models = ql.auto_run(train[train.columns], target, n_epochs=10, criterion='bic')
Loss: 1.44E-01Epoch no. 10 - Tried 21223 models - Best loss: 1.44e-01(6, -1, -1) Dipolemomentcomponen7 linear: scale=0.247374 w=-0.789432 bias=-2.0718Dipolemo..0num(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=-0.662792 bias=0.7095Gearyaut..1num(5, -1, -1) ComplementaryInforma6 linear: scale=1.175088 w=0.695922 bias=-0.2594Compleme..2num(11, -1, -1) Moranautocorrelation12 linear: scale=1.886792 w=-0.244107 bias=1.5043Moranaut..3num(12, 1, 1) cell:add(i,i)->iadd4x0x1(12, 7, 7) cell:add(i,i)->iadd5x0x1(13, 0, 0) cell:add(i,i)->iadd6x0x1(14, 1, 1) cell:exp(i)->iexp7x0(15, 1, 2) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-0.4244 bias=2.148348hHoney..8out
models[0].plot(train,test)
(6, -1, -1) Dipolemomentcomponen7 linear: scale=0.247374 w=-0.789432 bias=-2.0718Dipolemo..0num(8, -1, -1) Gearyautocorrelation9 linear: scale=1.028807 w=-0.662792 bias=0.7095Gearyaut..1num(5, -1, -1) ComplementaryInforma6 linear: scale=1.175088 w=0.695922 bias=-0.2594Compleme..2num(11, -1, -1) Moranautocorrelation12 linear: scale=1.886792 w=-0.244107 bias=1.5043Moranaut..3num(12, 1, 1) cell:add(i,i)->iadd4x0x1(12, 7, 7) cell:add(i,i)->iadd5x0x1(13, 0, 0) cell:add(i,i)->iadd6x0x1(14, 1, 1) cell:exp(i)->iexp7x0(15, 1, 2) 48hHoneybeetoxicitya17 linear: scale=1.778500 w=-0.4244 bias=2.148348hHoney..8out-0.39-0.52-0.39-0.27-0.68-0.53-0.84-0.870.87-10+1Pearson correlationMetricsR20.759RMSE0.379MAE0.287Test MetricsR20.928RMSE0.124MAE0.109

Nice R2! This model outperforms the ensemble predictions in the literature.

models[0].plot_regression(train)

png

models[0].plot_regression(test)

png

The model captures the actual toxicity surprisingly well. Note, to fully validate this model we would want to test it on another set of compounds. Also, even though the model itself is explainable, the features it includes (autocorrelation, complementary information) are hardly intuitive. More work has to be done to come up with meaningful interpretable features that also hold predictive power.

Conclusion

In this tutorial we have built linear and non-linear models using the QLattice on a typical QSAR dataset. The smart feature selection coupled with the simplicity of the models enables researchers to rapidly explore potential mechanisms underlying the data.

This example was based on a relatively small dataset (where simple models excel). But the QLattice can readily be applied to large and wide datasets. We're looking into it ;)

← Covid-19 RNA vaccine degradation data setClassifying toxicity of antisense oligonucleotides →
  • Strategy 1: Create a linear QSAR model
    • Initialize a Community QLattice
    • Run the QLattice over 20 epochs
    • Evaluate the performance on the training and test data
  • Strategy 2: Create a simple and non-linear QSAR model
    • The QLattice arrives at two-descriptor only model. The test metrics look really good:
    • How do our predictions compare to the actual toxicity?
    • Interpret the simple non-linear model
    • Show the model as a mathematical expression
  • Strategy 3: Full-on QLattice modelling
    • Conclusion

Subscribe to get news about Feyn and the QLattice.

You can opt out at any time, and you can read our privacy policy here.

Copyright © 2024 Abzu.ai - Feyn license: CC BY-NC-ND 4.0
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®