Feyn

Feyn

  • Tutorials
  • Guides
  • API Reference
  • FAQ

›Tutorials

Tutorials

  • Covid-19 vaccination RNA dataset.
  • Titanic: A general binary classification case

Covid-19 vaccination RNA dataset.

by: Meera Machado & Chris Cave

Feyn version: 1.4.+

In this tutorial we are going to go through a typical QLattice workflow. We perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.

Some of the potential Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.

The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at 50oC50^o \mathrm{C}50oC with and without magnesium.

In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.

Let's take a look!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import feyn

from sklearn.model_selection import train_test_split

%matplotlib inline

The scientific method is core to the workflow of this notebook. We will make observations, pose questions and form hypotheses.

# Here is our engineered dataset

eng_data = pd.read_csv('../data/covid_simple.csv')
eng_data.head()
A_percent G_percent C_percent U_percent U-G C-G U-A G-C A-U G-U pairs_rate SN_filter mean_reactivity mean_deg_Mg_pH10 mean_deg_pH10 mean_deg_Mg_50C mean_deg_50C
0 0.420561 0.177570 0.214953 0.186916 0.086957 0.130435 0.260870 0.347826 0.173913 0.000000 0.429907 1 0.502631 0.559628 0.503501 0.538540 0.585893
1 0.233645 0.308411 0.299065 0.158879 0.030303 0.363636 0.030303 0.393939 0.121212 0.060606 0.616822 0 0.411401 0.698354 0.846041 0.930103 0.865125
2 0.401869 0.224299 0.186916 0.186916 0.041667 0.208333 0.208333 0.291667 0.125000 0.125000 0.448598 1 0.433966 0.578362 0.548174 0.570284 0.575172
3 0.261682 0.327103 0.327103 0.084112 0.000000 0.437500 0.062500 0.406250 0.031250 0.062500 0.598131 0 0.329059 0.826485 1.051551 1.221297 0.696476
4 0.542056 0.056075 0.056075 0.345794 0.000000 0.000000 0.457143 0.000000 0.542857 0.000000 0.654206 0 0.282204 0.469587 0.384826 0.393072 0.440115

There's a feature we haven't talked about, the SN_filter. This is the signal to noise filter capturing which RNA molecules passed the evaluation criteria defined by the Standford researchers. This means we will drop the rows with SN_filter = 0

eng_data_SN1 = eng_data.query('SN_filter == 1')

So now we want to use the QLattice! How do we start?

First we need to choose a target variable. We choose degradation with magnesium at pH10. The higher its value, the higher the chance of the mRNA "breaking" at that nucleotide location. In eng_data we simply took the mean across the nucleotides of each mRNA sample.

This notebook can easily be modified for the different target variables. Please get your hands dirty and mess around with this.

target = 'mean_deg_Mg_pH10'

We have a chosen a target variable. What do we do next? This is the moment we differ from traditional machine learning techniques.

We need to take a step back and think about what do we want to ask about our data, i.e. what observations can we make? Let's start with baby steps:

Can a single feature already explain some of the output?

plt.figure(figsize=(21,21))

for ix, col in enumerate(eng_data_SN1.columns[:11]):
    plt.subplot(3,4, ix+1)
    plt.scatter(eng_data_SN1[col], eng_data_SN1[target])
    plt.xlabel(col)

svg

Observation: note how U_percent seems to be linearly correlated to the mean_deg_Mg_pH10.

Let's calculate their Pearson's correlations:

drop_cols = ['SN_filter'] + [out for out in eng_data_SN1.columns if 'mean' in out]
for feat in eng_data_SN1.drop(drop_cols, axis=1).columns:
    print(feat, feyn.metrics.calculate_pc(eng_data_SN1[feat], eng_data_SN1['mean_deg_Mg_pH10']))
A_percent -0.06920561730299386
G_percent -0.15027559191692702
C_percent -0.10138997239078872
U_percent 0.3623464717168081
U-G -0.11540344284319401
C-G -0.006336444226142389
U-A 0.14079264385434193
G-C -0.10153195407020321
A-U 0.07401507079974223
G-U -0.01572146918393174
pairs_rate 0.04032118343747456

As we can see, U_percent is certainly the one with highest linear correlation!

Before we move on to the question, let's split the data into train, validation and holdout sets:

train, test = train_test_split(eng_data_SN1, test_size=0.4, random_state=527140)
valid, holdo = train_test_split(test, test_size=0.5, random_state=5271740)

ql = feyn.QLattice()

Question 1: How does U_percent relate to the target variable?

Always have a question in mind before you start working with the QLattice.

We want to use this question to define the types of graphs we want to extract from the QLattice, i.e. the question you form will allow you to form the hypotheses to it. Throughout this notebook think of graphs generated by the QLattice as hypotheses to your question.

You translate your question into the QLattice by defining the QGraph as a regressor or classifier, setting input and output features, fixing the graphs' maximum depth and number of edges, selecting types of funtions, etc. Check out the filters page for more information on this

The QGraph is the list of hypotheses generated by the QLattice!

In the current case, we extract a regressor with a single input feature, U_percent, and limit the graphs to depth one. In other words, U_percent is mapped to mean_deg_Mg_pH10 with a single function. This will generate hypotheses of the type "U_percent has an f(x)f(x)f(x) relationship to mean_deg_Mg_pH10" which provides answers to the question above.

ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)

In our limited problem, the number of possible graph architectures is not large. This means 20 iterations should be enough to explore the possibilities and pick out the best graph.

for _ in range(20):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 7.67E-03Fitting 20: 100% completed. Best loss so far: 0.007668(0, -1, -1) U_percent linear: scale=6.294118 w=0.510478 bias=-0.6567U_percen..0num(8, 19, 4) cell:gaussian(i,i)->igaussian1x0x1(9, 19, 4) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.6388 bias=0.7524mean_deg..2out

You can peer into the QGraph by calling its head. It's a good way to check the achitecture diversity of the graphs, i.e. the range of hypotheses that the QLattice have generated

qgraph.head()
Loss: 7.67E-03(0, -1, -1) U_percent linear: scale=6.294118 w=0.510478 bias=-0.6567U_percen..0num(8, 19, 4) cell:gaussian(i,i)->igaussian1x0x1(9, 19, 4) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.6388 bias=0.7524mean_deg..2out Loss: 7.67E-03(0, -1, -1) U_percent linear: scale=6.294118 w=-0.532175 bias=0.6838U_percen..0num(18, 1, 6) cell:gaussian(i,i)->igaussian1x0x1(19, 2, 6) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.6068 bias=0.7854mean_deg..2out Loss: 7.67E-03(0, -1, -1) U_percent linear: scale=6.294118 w=-0.504834 bias=0.6496U_percen..0num(17, 0, 4) cell:gaussian(i,i)->igaussian1x0x1(18, 19, 3) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.6479 bias=0.7431mean_deg..2out Loss: 7.67E-03(0, -1, -1) U_percent linear: scale=6.294118 w=0.551522 bias=-0.7078U_percen..0num(8, 10, 0) cell:gaussian(i,i)->igaussian1x0x1(9, 11, 0) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.5813 bias=0.8117mean_deg..2out Loss: 7.67E-03(0, -1, -1) U_percent linear: scale=6.294118 w=0.654202 bias=-0.8444U_percen..0num(14, 17, 4) cell:gaussian(i)->igaussian1x0(15, 18, 3) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.7249 bias=0.6641mean_deg..2out

Usually what we see when we run this fitting loop is a gaussian or a sine function. Try running it a few times, with different amount of fitting loops to see if you get the same results as us.

This is different to our expectation of U_percent predicting the target variable by a linear function. Here sine and gaussian function demonstrates that there could be is a non-linear relation between U_percent and mean_deg_Mg_pH10. This leads us to our first hypothesis:

Hypothesis: U_percent has this gaussian / sine relation to mean_deg_Mg_pH10

We test this hypothesis. We use the R2 score to verify whether predictions that come from this hypothesis are better than random guessing. The R2 score is one of many metrics one can use to verify this. Other examples are root mean squared error and Students t-test on the distribution of the residuals.

best_U_perc = qgraph[0]
best_U_perc.r2_score(train)
0.19256211963333614
best_U_perc.r2_score(valid)
0.19674207680952693

Note how the R2 scores are similar to the train and validation sets, which means we are not overfitting.

We can also consider the R2 scores above as a baseline for further questions and QLattice explorations.

Let's take a look at our prediction of mean_deg_Mg_pH10 vs U_percent

best_U_perc.plot_partial(train, by='U_percent')
No handles with labels found to put in legend.

svg

What is this curve capturing that a linear function cannot? This leads us to a sub question

Question 1b: How does U_percent linearly relate to the target variable and how does it compare to the above relation?

ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)\
.filter(feyn.filters.Functions('linear'))
for _ in range(20):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 8.35E-03Fitting 20: 100% completed. Best loss so far: 0.008352(0, -1, -1) U_percent linear: scale=6.294118 w=-0.784849 bias=0.4739U_percen..0num(3, 7, 5) mean_deg_Mg_pH10 linear: scale=0.359565 w=-0.3386 bias=1.1789mean_deg..1out

Now we have adjusted the QLattice to form an alternative hypothesis to the previous question: U_precent has a linear relationship to mean_deg_Mg_pH10.

Again we use the R2 score to compare both hypotheses.

best_U_lin = qgraph[0]
best_U_lin.r2_score(train)
0.11813109310478409
best_U_lin.r2_score(valid)
0.14485858904785553

So we can already see that the R2 scores are not as good as with the gaussian relation. This makes us favour the gaussian / sine hypothesis over the linear one.

Let's plot the line!

best_U_lin.plot_partial(train,by='U_percent')
No handles with labels found to put in legend.

svg

So here we can see that there's a lot of points that drag the gradient down. This affects the predicted mean degradation for the small and the large values of U_percent. On the other hand the curve from the non-linear relation has managed to capture the lower mean_deg_Mg_pH10 values at the low and high U_percent values. This explains the higher R2 scores for the gaussian.

This indicates that there's more here than meets the eye. There are non-linear relationships to be explored between the features and the output.

From the Eterna game, we can assume that forming pairs is important. We don't know how but that can be the subject of our next question.

Question 2: How does pairs_rate relate to the target variable?

ql.reset()
qgraph = ql.get_regressor(['pairs_rate'], target, max_depth=1)
for _ in range(20):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 6.50E-03Fitting 20: 100% completed. Best loss so far: 0.006503(0, -1, -1) pairs_rate linear: scale=3.566667 w=-0.695501 bias=0.9779pairs_ra..0num(2, 16, 6) cell:gaussian(i)->igaussian1x0(3, 16, 5) mean_deg_Mg_pH10 linear: scale=0.359565 w=0.8832 bias=0.5677mean_deg..2out

Why not take a look at the head yourself and see how uniform the QGraph is.

Hypothesis: pairs_rate has this gaussian relationship to mean_deg_Mg_pH10

best_pairs = qgraph[0]
best_pairs.r2_score(train)
0.31523686520756944
best_pairs.r2_score(valid)
0.2867096734133877

See how the R2 scores have increased compared to just U_percent.

Let's plot this!

best_pairs.plot_partial(train, by='pairs_rate')
No handles with labels found to put in legend.

svg

This is another gaussian :p.

Let's get our feet a little wet and dabble in a little complexity.

Question 3: How does U_percent and pairs_rate combine to predict the target variable?

ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate'], target, max_depth=1)
for _ in range(30):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 5.04E-03Fitting 30: 100% completed. Best loss so far: 0.005037(0, -1, -1) U_percent linear: scale=6.294118 w=0.225398 bias=-0.5403U_percen..0num(1, -1, -1) pairs_rate linear: scale=3.566667 w=-0.518228 bias=0.6929pairs_ra..1num(8, 0, 5) cell:gaussian(i,i)->igaussian2x0x1(9, 19, 6) mean_deg_Mg_pH10 linear: scale=0.359565 w=1.6715 bias=0.1218mean_deg..3out

Notice here we've increased the number of loops to 30. This is because the search space has increased. There are graphs with two features and also graphs with a single feature. They are competing against each other so we need to allow the QLattice more time to find the strongest graphs.

Hypothesis: U_percent and pairs_rate has this gaussian relationship to mean_deg_Mg_pH10

best_U_perc_pairs = qgraph[0]
best_U_perc_pairs.r2_score(train)
0.46917256079886294
best_U_perc_pairs.r2_score(valid)
0.45529212123027185

This is a big jump! Let's investigate further.

best_U_perc_pairs.plot_partial(train,by='U_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6]})
best_U_perc_pairs.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean()})

svg

svg

This partial plot is a little more complex because our graph is 2D. What we are seeing is how the target varies as both pairs_rate and U_percent change.

Interestingly, the gaussian on U_percent has changed significantly compared to our first plot where now U_percent follows an upwards trend. Remember from the beginning, U_percent by itself fitted a gaussian much better than a linear function would. However with the inclusion of pairs_rate, U_percent behaves much more like a linear function than a gaussian. This is because with the inclusion of pairs_rate we have sectioned U_percent for different pairs_rate values and allowed for fitting multiple lines to it. In other words instead of fitting a single gaussian we have different upwards trends for different values of pairs_rates that captures mean degradation better than before.

Look at the top plot with U_percent as the x-axis: notice how the corresponding gaussian moves up when pairs_rate changes from 0.2 to 0.35. Then it moves down when pairs_rate changes from 0.35 to 0.6. This traces out the gaussian in the bottom plot with pairs_rate as the x-axis.

Let's have a look at something a little more complex.

Question 4: Given U_percent and pairs_rate, is there a third feature that improves the explaination of the target variable?

Here we will make sure each graph will contain U_percent and pairs_rate. We will let the other features compete for the last spot. This means we need to increase the amount of fitting loops.

train.drop(drop_cols, axis=1)
A_percent G_percent C_percent U_percent U-G C-G U-A G-C A-U G-U pairs_rate
1201 0.401869 0.205607 0.224299 0.168224 0.043478 0.217391 0.217391 0.304348 0.130435 0.086957 0.429907
1138 0.485981 0.196262 0.205607 0.112150 0.000000 0.307692 0.153846 0.307692 0.230769 0.000000 0.485981
56 0.392523 0.242991 0.261682 0.102804 0.000000 0.354839 0.193548 0.354839 0.096774 0.000000 0.579439
1183 0.224299 0.336449 0.280374 0.158879 0.000000 0.416667 0.083333 0.333333 0.166667 0.000000 0.448598
1316 0.700935 0.158879 0.074766 0.065421 0.000000 0.285714 0.285714 0.000000 0.428571 0.000000 0.130841
2084 0.598131 0.074766 0.093458 0.233645 0.000000 0.074074 0.407407 0.074074 0.444444 0.000000 0.504673
2316 0.411215 0.252336 0.196262 0.140187 0.045455 0.272727 0.136364 0.318182 0.227273 0.000000 0.411215
40 0.373832 0.271028 0.271028 0.084112 0.000000 0.407407 0.074074 0.407407 0.111111 0.000000 0.504673
1476 0.457944 0.242991 0.214953 0.084112 0.000000 0.500000 0.181818 0.181818 0.136364 0.000000 0.411215
1619 0.355140 0.271028 0.196262 0.177570 0.066667 0.266667 0.166667 0.233333 0.233333 0.033333 0.560748
2117 0.392523 0.214953 0.196262 0.196262 0.058824 0.117647 0.235294 0.323529 0.235294 0.029412 0.635514
1843 0.327103 0.280374 0.261682 0.130841 0.035714 0.357143 0.142857 0.285714 0.107143 0.071429 0.523364
2342 0.485981 0.168224 0.140187 0.205607 0.000000 0.259259 0.407407 0.074074 0.259259 0.000000 0.504673
1550 0.242991 0.336449 0.327103 0.093458 0.032258 0.483871 0.096774 0.322581 0.000000 0.064516 0.579439
1392 0.317757 0.261682 0.130841 0.289720 0.216216 0.081081 0.189189 0.135135 0.162162 0.216216 0.691589
1932 0.392523 0.196262 0.205607 0.205607 0.037037 0.074074 0.222222 0.370370 0.259259 0.037037 0.504673
281 0.308411 0.289720 0.242991 0.158879 0.071429 0.321429 0.214286 0.321429 0.000000 0.071429 0.523364
529 0.392523 0.233645 0.280374 0.093458 0.043478 0.391304 0.173913 0.347826 0.043478 0.000000 0.429907
948 0.280374 0.336449 0.299065 0.084112 0.030303 0.393939 0.060606 0.393939 0.090909 0.030303 0.616822
2075 0.392523 0.196262 0.205607 0.205607 0.035714 0.214286 0.285714 0.214286 0.250000 0.000000 0.523364
360 0.476636 0.186916 0.205607 0.130841 0.000000 0.307692 0.192308 0.269231 0.230769 0.000000 0.485981
2140 0.355140 0.242991 0.242991 0.158879 0.000000 0.206897 0.275862 0.344828 0.172414 0.000000 0.542056
443 0.429907 0.196262 0.196262 0.177570 0.000000 0.172414 0.241379 0.344828 0.241379 0.000000 0.542056
2293 0.271028 0.299065 0.224299 0.205607 0.142857 0.250000 0.071429 0.214286 0.107143 0.214286 0.523364
389 0.327103 0.252336 0.130841 0.289720 0.216216 0.081081 0.162162 0.135135 0.216216 0.189189 0.691589
2299 0.411215 0.224299 0.158879 0.205607 0.071429 0.142857 0.285714 0.214286 0.250000 0.035714 0.523364
1675 0.383178 0.252336 0.233645 0.130841 0.038462 0.307692 0.076923 0.307692 0.269231 0.000000 0.485981
1294 0.373832 0.289720 0.261682 0.074766 0.000000 0.409091 0.090909 0.409091 0.045455 0.045455 0.411215
1727 0.476636 0.186916 0.158879 0.177570 0.000000 0.214286 0.357143 0.178571 0.214286 0.035714 0.523364
1677 0.401869 0.242991 0.233645 0.121495 0.033333 0.233333 0.066667 0.400000 0.266667 0.000000 0.560748
... ... ... ... ... ... ... ... ... ... ... ...
2127 0.364486 0.271028 0.242991 0.121495 0.045455 0.409091 0.136364 0.318182 0.045455 0.045455 0.411215
385 0.467290 0.186916 0.205607 0.140187 0.000000 0.285714 0.178571 0.285714 0.250000 0.000000 0.523364
782 0.317757 0.224299 0.261682 0.196262 0.000000 0.333333 0.133333 0.200000 0.266667 0.066667 0.560748
1058 0.401869 0.205607 0.205607 0.186916 0.035714 0.178571 0.285714 0.214286 0.214286 0.071429 0.523364
1511 0.271028 0.383178 0.327103 0.018692 0.000000 0.500000 0.000000 0.500000 0.000000 0.000000 0.485981
2239 0.476636 0.224299 0.149533 0.149533 0.045455 0.227273 0.181818 0.227273 0.227273 0.090909 0.411215
934 0.327103 0.261682 0.271028 0.140187 0.000000 0.354839 0.129032 0.322581 0.193548 0.000000 0.579439
902 0.439252 0.196262 0.196262 0.168224 0.000000 0.200000 0.266667 0.266667 0.066667 0.200000 0.280374
737 0.345794 0.224299 0.280374 0.149533 0.000000 0.225806 0.161290 0.354839 0.258065 0.000000 0.579439
346 0.336449 0.280374 0.205607 0.177570 0.034483 0.206897 0.172414 0.310345 0.172414 0.103448 0.542056
457 0.700935 0.056075 0.168224 0.074766 0.142857 0.142857 0.428571 0.000000 0.285714 0.000000 0.130841
1973 0.429907 0.214953 0.252336 0.102804 0.000000 0.363636 0.181818 0.227273 0.136364 0.090909 0.411215
2384 0.411215 0.214953 0.186916 0.186916 0.000000 0.291667 0.208333 0.208333 0.166667 0.125000 0.448598
409 0.280374 0.214953 0.280374 0.224299 0.055556 0.277778 0.111111 0.444444 0.111111 0.000000 0.336449
1992 0.401869 0.252336 0.186916 0.158879 0.000000 0.333333 0.185185 0.148148 0.296296 0.037037 0.504673
1919 0.429907 0.196262 0.186916 0.186916 0.000000 0.233333 0.233333 0.233333 0.300000 0.000000 0.560748
1990 0.336449 0.271028 0.196262 0.196262 0.031250 0.187500 0.218750 0.250000 0.281250 0.031250 0.598131
1684 0.327103 0.280374 0.242991 0.149533 0.034483 0.206897 0.172414 0.413793 0.137931 0.034483 0.542056
1118 0.289720 0.233645 0.336449 0.140187 0.000000 0.290323 0.161290 0.387097 0.161290 0.000000 0.579439
311 0.308411 0.252336 0.196262 0.242991 0.090909 0.242424 0.242424 0.181818 0.151515 0.090909 0.616822
1487 0.392523 0.242991 0.242991 0.121495 0.000000 0.310345 0.137931 0.379310 0.172414 0.000000 0.542056
252 0.626168 0.205607 0.093458 0.074766 0.142857 0.428571 0.000000 0.142857 0.142857 0.142857 0.130841
2348 0.691589 0.168224 0.084112 0.056075 0.000000 0.000000 0.285714 0.428571 0.285714 0.000000 0.130841
155 0.373832 0.261682 0.196262 0.168224 0.000000 0.360000 0.280000 0.240000 0.120000 0.000000 0.467290
1388 0.392523 0.233645 0.252336 0.121495 0.033333 0.366667 0.200000 0.266667 0.133333 0.000000 0.560748
1507 0.439252 0.345794 0.177570 0.037383 0.000000 0.333333 0.000000 0.444444 0.222222 0.000000 0.168224
649 0.345794 0.224299 0.280374 0.149533 0.000000 0.258065 0.193548 0.387097 0.161290 0.000000 0.579439
1800 0.383178 0.252336 0.271028 0.093458 0.000000 0.450000 0.000000 0.550000 0.000000 0.000000 0.373832
1540 0.308411 0.252336 0.252336 0.186916 0.064516 0.258065 0.129032 0.322581 0.129032 0.096774 0.579439
1873 0.429907 0.214953 0.233645 0.121495 0.000000 0.250000 0.250000 0.375000 0.125000 0.000000 0.448598

953 rows × 11 columns

ql.reset()
qgraph = ql.get_regressor(train.drop(drop_cols, axis=1).columns.to_list(), target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))\
.filter(feyn.filters.Contains(['U_percent', 'pairs_rate']))
for _ in range(50):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 4.67E-03Fitting 50: 100% completed. Best loss so far: 0.004667(0, -1, -1) A_percent linear: scale=3.689655 w=0.605482 bias=-0.9347A_percen..0num(10, -1, -1) pairs_rate linear: scale=3.566667 w=-0.675559 bias=0.8816pairs_ra..1num(3, -1, -1) U_percent linear: scale=6.294118 w=-0.482730 bias=-0.5112U_percen..2num(11, 8, 4) cell:gaussian(i,i)->igaussian3x0x1(12, 9, 3) cell:multiply(i,i)->imultiply4x0x1(13, 8, 3) mean_deg_Mg_pH10 linear: scale=0.359565 w=-0.8775 bias=0.6574mean_deg..5out

Since the diversity of graphs have increased, this means that there are many competing solutions to the question. We sometimes see A-U or U-G getting the third feature and sometimes no third feature is chosen.

However if you run the above three cells several times then you will see that A_percent and G_percent tend to be the two best candidates for the third feature.

This means we don't have a fixed hypothesis this time which leads us to reformulate the previous question into two new ones.

Question 4b: How does U_percent, pairs_rate and G_percent combine to explain the target variable?

ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'G_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
for _ in range(50):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 4.87E-03Fitting 50: 100% completed. Best loss so far: 0.004873(1, -1, -1) pairs_rate linear: scale=3.566667 w=-0.555226 bias=0.7443pairs_ra..0num(2, -1, -1) G_percent linear: scale=4.976744 w=-0.365490 bias=0.3845G_percen..1num(4, 10, 3) cell:gaussian(i,i)->igaussian2x0x1(0, -1, -1) U_percent linear: scale=6.294118 w=-0.811549 bias=-2.0574U_percen..3num(5, 11, 3) cell:multiply(i,i)->imultiply4x0x1(6, 12, 2) mean_deg_Mg_pH10 linear: scale=0.359565 w=-0.4343 bias=0.2647mean_deg..5out
best_U_pr_G_perc = qgraph[0]

Hypothesis: mean_deg_Mg_pH10 has this above relation between G_percent, U_percent, and pairs_rate

Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. In all the runs of this notebook so far, this hypothesis has been the winner. Remember that the graphs are the hypotheses and they compete against each other. This means that depending on the data, the top graph may differ between different QLattice runs.

best_U_pr_G_perc.r2_score(train)
0.489082790675745
best_U_pr_G_perc.r2_score(valid)
0.4801340928388709

Question 4c: How does U_percent, pairs_rate and A_percent combine to explain the target variable?

ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'A_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
for _ in range(50):
    qgraph.fit(train, threads=6)
    ql.update(qgraph.best())
Loss: 4.61E-03Fitting 50: 100% completed. Best loss so far: 0.004614(1, -1, -1) pairs_rate linear: scale=3.566667 w=0.553487 bias=-0.7046pairs_ra..0num(2, -1, -1) A_percent linear: scale=3.689655 w=-0.535442 bias=0.8240A_percen..1num(6, 17, 6) cell:gaussian(i,i)->igaussian2x0x1(0, -1, -1) U_percent linear: scale=6.294118 w=-0.885311 bias=-1.6114U_percen..3num(7, 17, 6) cell:multiply(i,i)->imultiply4x0x1(8, 17, 5) mean_deg_Mg_pH10 linear: scale=0.359565 w=-0.4349 bias=0.4435mean_deg..5out
best_U_pr_A_perc = qgraph[0]

Hypothesis: mean_deg_Mg_pH10 has this above relation between A_percent, U_percent, and pairs_rate

Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. Do your results agree with this? Again run the above cells many times and see which hypothesis tends to occur the most.

best_U_pr_A_perc.r2_score(train)
0.5134878825008996
best_U_pr_A_perc.r2_score(valid)
0.5079327377293561

Here we can see that A_percent performs better than G_percent in terms of R2 score on train and validation sets. This means we favour the hypothesis that takes A_percent as the third feature over G_percent.

Let's further explore the model we just got

best_U_pr_A_perc.sympify(signif=3)

−0.156(−5.57Upercent−1.61)e−7.81(0.417−Apercent)2−7.79(pairsrate−0.357)2+0.443\displaystyle - 0.156 \left(- 5.57 U_{percent} - 1.61\right) e^{- 7.81 \left(0.417 - A_{percent}\right)^{2} - 7.79 \left(pairs_{rate} - 0.357\right)^{2}} + 0.443−0.156(−5.57Upercent​−1.61)e−7.81(0.417−Apercent​)2−7.79(pairsrate​−0.357)2+0.443

From resetting the QLattice and running the loop a few times, we usually see that pairs_rate and A_percent go into a gaussian while U_percent enters an ascending function (sometimes a gaussian, sometimes a exp). Afterwards they interact in a multiplicative fashion.

Let's check the partial plots

best_U_pr_A_perc.plot_partial(train,by='A_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6], 'U_percent' : train['U_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean(), 'A_percent' : train['A_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='U_percent', fixed = {'pairs_rate' : [0.2,0.35,0.5,0.6], 'A_percent' : train['A_percent'].mean()})

svg

svg

svg

We can see that A_percent has an island of points at high values but with low mean_deg_Mg_pH10. This is why the QLattice chose the gaussian to match it. Observe that the pairs_rate behaves similarly as the graph from Question 3.

Concluding remarks

The graph that contains the most signal in terms of R2 score takes U_percent, pairs_rate and A_percent. We usually see that A_percent and pairs_rate fit to gaussians. What this means is that at lower and higher rates of these values, the mean_deg_Mg_pH10 are quite low. However mean degradation increases as values move towards the middle. In other words there is a hot spot where mean degradation is at its highest. For A_percent this is usually around 0.42 and pairs_rate ~0.36. In contrast to A_percent and pairs_rate, we do not have samples where U_percent goes above 0.35. Since U_percent follows an upwards trend in this region, 0.35 would yield the highest mean degradation.

Let's settle on this hypothesis that A_percent, U_percent and pairs_rate explain the mean degradation of the mRNA samples according to the expression below:

best_U_pr_A_perc.sympify(signif=2)

−0.16(−5.6Upercent−1.6)e−7.8(0.42−Apercent)2−7.8(pairsrate−0.36)2+0.44\displaystyle - 0.16 \left(- 5.6 U_{percent} - 1.6\right) e^{- 7.8 \left(0.42 - A_{percent}\right)^{2} - 7.8 \left(pairs_{rate} - 0.36\right)^{2}} + 0.44−0.16(−5.6Upercent​−1.6)e−7.8(0.42−Apercent​)2−7.8(pairsrate​−0.36)2+0.44

Ideally we want to test this hypothesis on data gathered from a new experiment. If we can't do that then let's use the holdout set to mimic this.

best_U_pr_A_perc.r2_score(holdo)
0.41346295608630135

See how the R2 score is lower than on our training and validation sets. Since our hypothesis in the "real-world" did not perform as we expected we need to go back and formulate new questions to the data and extract new hypotheses. Notice how this R2-score is similar to when we had only U_percent and pairs_rate, this could be a point we can return to.

A big warning here is about reusing your holdout set. If the hypothesis we settled on performs similarly on train, validation and holdout it does not necessarily mean it will generalise to new data. This is because the holdout set is now contaminated since we are biased towards it. Now it means less to perform well on the holdout set.

A simple suggestion, if data allows it, is to have multiple holdout sets. This will mimic performing multiple experiments. Another suggestion could be to reseed you train/valid/holdo split. However you should still take this with a grain of salt because you can still hold biases.

Suggestions for next steps

Well done on completing this tutorial! Here are some suggestions you can do for a bit of extra fun:

  1. Try increasing complexity a little bit more. See if you get a higher R2 score in the train and validation sets when you include both A_percent and G_percent along with U_percent and pairs_rate. You do this by removing the MaxEdges filter and keeping max_depth to two.
  2. Try this out for the other average measurment values: mean_reactivity, mean_deg_50C etc. Do we obtain the same gaussian behaviour as we do with mean_deg_Mg_pH10?
  3. The QLattice hasn't picked up on the specific pairs. What happens if you force the QLattice to include them?

Remember to ALWAYS ask a question before extracting hypotheses (QGraph) from the QLattice. The scientific method is your guiding principle when using the QLattice in your quest for truth!

Titanic: A general binary classification case →
  • Question 1: How does U_percent relate to the target variable?
    • Hypothesis: U_percent has this gaussian / sine relation to mean_deg_Mg_pH10
  • Question 1b: How does U_percent linearly relate to the target variable and how does it compare to the above relation?
  • Question 2: How does pairs_rate relate to the target variable?
    • Hypothesis: pairs_rate has this gaussian relationship to mean_deg_Mg_pH10
  • Question 3: How does U_percent and pairs_rate combine to predict the target variable?
    • Hypothesis: U_percent and pairs_rate has this gaussian relationship to mean_deg_Mg_pH10
  • Question 4: Given U_percent and pairs_rate, is there a third feature that improves the explaination of the target variable?
  • Question 4b: How does U_percent, pairs_rate and G_percent combine to explain the target variable?
    • Hypothesis: mean_deg_Mg_pH10 has this above relation between G_percent, U_percent, and pairs_rate
  • Question 4c: How does U_percent, pairs_rate and A_percent combine to explain the target variable?
    • Hypothesis: mean_deg_Mg_pH10 has this above relation between A_percent, U_percent, and pairs_rate
Copyright © 2021 Abzu.ai