Covid-19 vaccination RNA dataset.
by: Meera Machado & Chris Cave
Feyn version: 1.4.+
In this tutorial we are going to go through a typical QLattice
workflow. We perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.
Some of the potential Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.
The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at with and without magnesium.
In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.
Let's take a look!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import feyn
from sklearn.model_selection import train_test_split
%matplotlib inline
The scientific method is core to the workflow of this notebook. We will make observations, pose questions and form hypotheses.
# Here is our engineered dataset
eng_data = pd.read_csv('../data/covid_simple.csv')
eng_data.head()
A_percent | G_percent | C_percent | U_percent | U-G | C-G | U-A | G-C | A-U | G-U | pairs_rate | SN_filter | mean_reactivity | mean_deg_Mg_pH10 | mean_deg_pH10 | mean_deg_Mg_50C | mean_deg_50C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.420561 | 0.177570 | 0.214953 | 0.186916 | 0.086957 | 0.130435 | 0.260870 | 0.347826 | 0.173913 | 0.000000 | 0.429907 | 1 | 0.502631 | 0.559628 | 0.503501 | 0.538540 | 0.585893 |
1 | 0.233645 | 0.308411 | 0.299065 | 0.158879 | 0.030303 | 0.363636 | 0.030303 | 0.393939 | 0.121212 | 0.060606 | 0.616822 | 0 | 0.411401 | 0.698354 | 0.846041 | 0.930103 | 0.865125 |
2 | 0.401869 | 0.224299 | 0.186916 | 0.186916 | 0.041667 | 0.208333 | 0.208333 | 0.291667 | 0.125000 | 0.125000 | 0.448598 | 1 | 0.433966 | 0.578362 | 0.548174 | 0.570284 | 0.575172 |
3 | 0.261682 | 0.327103 | 0.327103 | 0.084112 | 0.000000 | 0.437500 | 0.062500 | 0.406250 | 0.031250 | 0.062500 | 0.598131 | 0 | 0.329059 | 0.826485 | 1.051551 | 1.221297 | 0.696476 |
4 | 0.542056 | 0.056075 | 0.056075 | 0.345794 | 0.000000 | 0.000000 | 0.457143 | 0.000000 | 0.542857 | 0.000000 | 0.654206 | 0 | 0.282204 | 0.469587 | 0.384826 | 0.393072 | 0.440115 |
There's a feature we haven't talked about, the SN_filter
. This is the signal to noise filter capturing which RNA molecules passed the evaluation criteria defined by the Standford researchers. This means we will drop the rows with SN_filter = 0
eng_data_SN1 = eng_data.query('SN_filter == 1')
So now we want to use the QLattice
! How do we start?
First we need to choose a target variable. We choose degradation with magnesium at pH10. The higher its value, the higher the chance of the mRNA "breaking" at that nucleotide location. In eng_data
we simply took the mean across the nucleotides of each mRNA sample.
This notebook can easily be modified for the different target variables. Please get your hands dirty and mess around with this.
target = 'mean_deg_Mg_pH10'
We have a chosen a target variable. What do we do next? This is the moment we differ from traditional machine learning techniques.
We need to take a step back and think about what do we want to ask about our data, i.e. what observations can we make? Let's start with baby steps:
Can a single feature already explain some of the output?
plt.figure(figsize=(21,21))
for ix, col in enumerate(eng_data_SN1.columns[:11]):
plt.subplot(3,4, ix+1)
plt.scatter(eng_data_SN1[col], eng_data_SN1[target])
plt.xlabel(col)
U_percent
seems to be linearly correlated to the mean_deg_Mg_pH10
.
Observation: note how Let's calculate their Pearson's correlations:
drop_cols = ['SN_filter'] + [out for out in eng_data_SN1.columns if 'mean' in out]
for feat in eng_data_SN1.drop(drop_cols, axis=1).columns:
print(feat, feyn.metrics.calculate_pc(eng_data_SN1[feat], eng_data_SN1['mean_deg_Mg_pH10']))
A_percent -0.06920561730299386
G_percent -0.15027559191692702
C_percent -0.10138997239078872
U_percent 0.3623464717168081
U-G -0.11540344284319401
C-G -0.006336444226142389
U-A 0.14079264385434193
G-C -0.10153195407020321
A-U 0.07401507079974223
G-U -0.01572146918393174
pairs_rate 0.04032118343747456
As we can see, U_percent
is certainly the one with highest linear correlation!
Before we move on to the question, let's split the data into train, validation and holdout sets:
train, test = train_test_split(eng_data_SN1, test_size=0.4, random_state=527140)
valid, holdo = train_test_split(test, test_size=0.5, random_state=5271740)
ql = feyn.QLattice()
U_percent
relate to the target variable?
Question 1: How does Always have a question in mind before you start working with the QLattice
.
We want to use this question to define the types of graphs we want to extract from the QLattice
, i.e. the question you form will allow you to form the hypotheses to it. Throughout this notebook think of graphs generated by the QLattice
as hypotheses to your question.
You translate your question into the QLattice
by defining the QGraph
as a regressor or classifier, setting input and output features, fixing the graphs' maximum depth and number of edges, selecting types of funtions, etc. Check out the filters page for more information on this
The QGraph
is the list of hypotheses generated by the QLattice
!
In the current case, we extract a regressor
with a single input feature, U_percent
, and limit the graphs to depth one. In other words, U_percent
is mapped to mean_deg_Mg_pH10
with a single function. This will generate hypotheses of the type "U_percent
has an relationship to mean_deg_Mg_pH10
" which provides answers to the question above.
ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)
In our limited problem, the number of possible graph architectures is not large. This means 20 iterations should be enough to explore the possibilities and pick out the best graph.
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
You can peer into the QGraph
by calling its head. It's a good way to check the achitecture diversity of the graphs, i.e. the range of hypotheses that the QLattice
have generated
qgraph.head()
Usually what we see when we run this fitting loop is a gaussian or a sine function. Try running it a few times, with different amount of fitting loops to see if you get the same results as us.
This is different to our expectation of U_percent
predicting the target variable by a linear function. Here sine
and gaussian
function demonstrates that there could be is a non-linear relation between U_percent
and mean_deg_Mg_pH10
. This leads us to our first hypothesis:
U_percent
has this gaussian / sine relation to mean_deg_Mg_pH10
Hypothesis: We test this hypothesis. We use the R2 score to verify whether predictions that come from this hypothesis are better than random guessing. The R2 score is one of many metrics one can use to verify this. Other examples are root mean squared error and Students t-test on the distribution of the residuals.
best_U_perc = qgraph[0]
best_U_perc.r2_score(train)
0.19256211963333614
best_U_perc.r2_score(valid)
0.19674207680952693
Note how the R2 scores are similar to the train and validation sets, which means we are not overfitting.
We can also consider the R2 scores above as a baseline for further questions and QLattice
explorations.
Let's take a look at our prediction of mean_deg_Mg_pH10
vs U_percent
best_U_perc.plot_partial(train, by='U_percent')
No handles with labels found to put in legend.
What is this curve capturing that a linear function cannot? This leads us to a sub question
U_percent
linearly relate to the target variable and how does it compare to the above relation?
Question 1b: How does ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)\
.filter(feyn.filters.Functions('linear'))
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
Now we have adjusted the QLattice
to form an alternative hypothesis to the previous question: U_precent
has a linear relationship to mean_deg_Mg_pH10
.
Again we use the R2 score to compare both hypotheses.
best_U_lin = qgraph[0]
best_U_lin.r2_score(train)
0.11813109310478409
best_U_lin.r2_score(valid)
0.14485858904785553
So we can already see that the R2 scores are not as good as with the gaussian relation. This makes us favour the gaussian / sine hypothesis over the linear one.
Let's plot the line!
best_U_lin.plot_partial(train,by='U_percent')
No handles with labels found to put in legend.
So here we can see that there's a lot of points that drag the gradient down. This affects the predicted mean degradation for the small and the large values of U_percent
. On the other hand the curve from the non-linear relation has managed to capture the lower mean_deg_Mg_pH10
values at the low and high U_percent
values. This explains the higher R2 scores for the gaussian.
This indicates that there's more here than meets the eye. There are non-linear relationships to be explored between the features and the output.
From the Eterna game, we can assume that forming pairs is important. We don't know how but that can be the subject of our next question.
pairs_rate
relate to the target variable?
Question 2: How does ql.reset()
qgraph = ql.get_regressor(['pairs_rate'], target, max_depth=1)
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
Why not take a look at the head yourself and see how uniform the QGraph
is.
pairs_rate
has this gaussian relationship to mean_deg_Mg_pH10
Hypothesis: best_pairs = qgraph[0]
best_pairs.r2_score(train)
0.31523686520756944
best_pairs.r2_score(valid)
0.2867096734133877
See how the R2 scores have increased compared to just U_percent
.
Let's plot this!
best_pairs.plot_partial(train, by='pairs_rate')
No handles with labels found to put in legend.
This is another gaussian :p.
Let's get our feet a little wet and dabble in a little complexity.
U_percent
and pairs_rate
combine to predict the target variable?
Question 3: How does ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate'], target, max_depth=1)
for _ in range(30):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
Notice here we've increased the number of loops to 30. This is because the search space has increased. There are graphs with two features and also graphs with a single feature. They are competing against each other so we need to allow the QLattice
more time to find the strongest graphs.
U_percent
and pairs_rate
has this gaussian relationship to mean_deg_Mg_pH10
Hypothesis: best_U_perc_pairs = qgraph[0]
best_U_perc_pairs.r2_score(train)
0.46917256079886294
best_U_perc_pairs.r2_score(valid)
0.45529212123027185
This is a big jump! Let's investigate further.
best_U_perc_pairs.plot_partial(train,by='U_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6]})
best_U_perc_pairs.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean()})
This partial plot is a little more complex because our graph is 2D. What we are seeing is how the target varies as both pairs_rate
and U_percent
change.
Interestingly, the gaussian on U_percent
has changed significantly compared to our first plot where now U_percent
follows an upwards trend. Remember from the beginning, U_percent
by itself fitted a gaussian much better than a linear function would. However with the inclusion of pairs_rate
, U_percent
behaves much more like a linear function than a gaussian. This is because with the inclusion of pairs_rate
we have sectioned U_percent
for different pairs_rate
values and allowed for fitting multiple lines to it. In other words instead of fitting a single gaussian we have different upwards trends for different values of pairs_rates
that captures mean degradation better than before.
Look at the top plot with U_percent
as the x-axis: notice how the corresponding gaussian moves up when pairs_rate
changes from 0.2 to 0.35. Then it moves down when pairs_rate
changes from 0.35 to 0.6. This traces out the gaussian in the bottom plot with pairs_rate
as the x-axis.
Let's have a look at something a little more complex.
U_percent
and pairs_rate
, is there a third feature that improves the explaination of the target variable?
Question 4: Given Here we will make sure each graph will contain U_percent
and pairs_rate
. We will let the other features compete for the last spot. This means we need to increase the amount of fitting loops.
train.drop(drop_cols, axis=1)
A_percent | G_percent | C_percent | U_percent | U-G | C-G | U-A | G-C | A-U | G-U | pairs_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
1201 | 0.401869 | 0.205607 | 0.224299 | 0.168224 | 0.043478 | 0.217391 | 0.217391 | 0.304348 | 0.130435 | 0.086957 | 0.429907 |
1138 | 0.485981 | 0.196262 | 0.205607 | 0.112150 | 0.000000 | 0.307692 | 0.153846 | 0.307692 | 0.230769 | 0.000000 | 0.485981 |
56 | 0.392523 | 0.242991 | 0.261682 | 0.102804 | 0.000000 | 0.354839 | 0.193548 | 0.354839 | 0.096774 | 0.000000 | 0.579439 |
1183 | 0.224299 | 0.336449 | 0.280374 | 0.158879 | 0.000000 | 0.416667 | 0.083333 | 0.333333 | 0.166667 | 0.000000 | 0.448598 |
1316 | 0.700935 | 0.158879 | 0.074766 | 0.065421 | 0.000000 | 0.285714 | 0.285714 | 0.000000 | 0.428571 | 0.000000 | 0.130841 |
2084 | 0.598131 | 0.074766 | 0.093458 | 0.233645 | 0.000000 | 0.074074 | 0.407407 | 0.074074 | 0.444444 | 0.000000 | 0.504673 |
2316 | 0.411215 | 0.252336 | 0.196262 | 0.140187 | 0.045455 | 0.272727 | 0.136364 | 0.318182 | 0.227273 | 0.000000 | 0.411215 |
40 | 0.373832 | 0.271028 | 0.271028 | 0.084112 | 0.000000 | 0.407407 | 0.074074 | 0.407407 | 0.111111 | 0.000000 | 0.504673 |
1476 | 0.457944 | 0.242991 | 0.214953 | 0.084112 | 0.000000 | 0.500000 | 0.181818 | 0.181818 | 0.136364 | 0.000000 | 0.411215 |
1619 | 0.355140 | 0.271028 | 0.196262 | 0.177570 | 0.066667 | 0.266667 | 0.166667 | 0.233333 | 0.233333 | 0.033333 | 0.560748 |
2117 | 0.392523 | 0.214953 | 0.196262 | 0.196262 | 0.058824 | 0.117647 | 0.235294 | 0.323529 | 0.235294 | 0.029412 | 0.635514 |
1843 | 0.327103 | 0.280374 | 0.261682 | 0.130841 | 0.035714 | 0.357143 | 0.142857 | 0.285714 | 0.107143 | 0.071429 | 0.523364 |
2342 | 0.485981 | 0.168224 | 0.140187 | 0.205607 | 0.000000 | 0.259259 | 0.407407 | 0.074074 | 0.259259 | 0.000000 | 0.504673 |
1550 | 0.242991 | 0.336449 | 0.327103 | 0.093458 | 0.032258 | 0.483871 | 0.096774 | 0.322581 | 0.000000 | 0.064516 | 0.579439 |
1392 | 0.317757 | 0.261682 | 0.130841 | 0.289720 | 0.216216 | 0.081081 | 0.189189 | 0.135135 | 0.162162 | 0.216216 | 0.691589 |
1932 | 0.392523 | 0.196262 | 0.205607 | 0.205607 | 0.037037 | 0.074074 | 0.222222 | 0.370370 | 0.259259 | 0.037037 | 0.504673 |
281 | 0.308411 | 0.289720 | 0.242991 | 0.158879 | 0.071429 | 0.321429 | 0.214286 | 0.321429 | 0.000000 | 0.071429 | 0.523364 |
529 | 0.392523 | 0.233645 | 0.280374 | 0.093458 | 0.043478 | 0.391304 | 0.173913 | 0.347826 | 0.043478 | 0.000000 | 0.429907 |
948 | 0.280374 | 0.336449 | 0.299065 | 0.084112 | 0.030303 | 0.393939 | 0.060606 | 0.393939 | 0.090909 | 0.030303 | 0.616822 |
2075 | 0.392523 | 0.196262 | 0.205607 | 0.205607 | 0.035714 | 0.214286 | 0.285714 | 0.214286 | 0.250000 | 0.000000 | 0.523364 |
360 | 0.476636 | 0.186916 | 0.205607 | 0.130841 | 0.000000 | 0.307692 | 0.192308 | 0.269231 | 0.230769 | 0.000000 | 0.485981 |
2140 | 0.355140 | 0.242991 | 0.242991 | 0.158879 | 0.000000 | 0.206897 | 0.275862 | 0.344828 | 0.172414 | 0.000000 | 0.542056 |
443 | 0.429907 | 0.196262 | 0.196262 | 0.177570 | 0.000000 | 0.172414 | 0.241379 | 0.344828 | 0.241379 | 0.000000 | 0.542056 |
2293 | 0.271028 | 0.299065 | 0.224299 | 0.205607 | 0.142857 | 0.250000 | 0.071429 | 0.214286 | 0.107143 | 0.214286 | 0.523364 |
389 | 0.327103 | 0.252336 | 0.130841 | 0.289720 | 0.216216 | 0.081081 | 0.162162 | 0.135135 | 0.216216 | 0.189189 | 0.691589 |
2299 | 0.411215 | 0.224299 | 0.158879 | 0.205607 | 0.071429 | 0.142857 | 0.285714 | 0.214286 | 0.250000 | 0.035714 | 0.523364 |
1675 | 0.383178 | 0.252336 | 0.233645 | 0.130841 | 0.038462 | 0.307692 | 0.076923 | 0.307692 | 0.269231 | 0.000000 | 0.485981 |
1294 | 0.373832 | 0.289720 | 0.261682 | 0.074766 | 0.000000 | 0.409091 | 0.090909 | 0.409091 | 0.045455 | 0.045455 | 0.411215 |
1727 | 0.476636 | 0.186916 | 0.158879 | 0.177570 | 0.000000 | 0.214286 | 0.357143 | 0.178571 | 0.214286 | 0.035714 | 0.523364 |
1677 | 0.401869 | 0.242991 | 0.233645 | 0.121495 | 0.033333 | 0.233333 | 0.066667 | 0.400000 | 0.266667 | 0.000000 | 0.560748 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2127 | 0.364486 | 0.271028 | 0.242991 | 0.121495 | 0.045455 | 0.409091 | 0.136364 | 0.318182 | 0.045455 | 0.045455 | 0.411215 |
385 | 0.467290 | 0.186916 | 0.205607 | 0.140187 | 0.000000 | 0.285714 | 0.178571 | 0.285714 | 0.250000 | 0.000000 | 0.523364 |
782 | 0.317757 | 0.224299 | 0.261682 | 0.196262 | 0.000000 | 0.333333 | 0.133333 | 0.200000 | 0.266667 | 0.066667 | 0.560748 |
1058 | 0.401869 | 0.205607 | 0.205607 | 0.186916 | 0.035714 | 0.178571 | 0.285714 | 0.214286 | 0.214286 | 0.071429 | 0.523364 |
1511 | 0.271028 | 0.383178 | 0.327103 | 0.018692 | 0.000000 | 0.500000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.485981 |
2239 | 0.476636 | 0.224299 | 0.149533 | 0.149533 | 0.045455 | 0.227273 | 0.181818 | 0.227273 | 0.227273 | 0.090909 | 0.411215 |
934 | 0.327103 | 0.261682 | 0.271028 | 0.140187 | 0.000000 | 0.354839 | 0.129032 | 0.322581 | 0.193548 | 0.000000 | 0.579439 |
902 | 0.439252 | 0.196262 | 0.196262 | 0.168224 | 0.000000 | 0.200000 | 0.266667 | 0.266667 | 0.066667 | 0.200000 | 0.280374 |
737 | 0.345794 | 0.224299 | 0.280374 | 0.149533 | 0.000000 | 0.225806 | 0.161290 | 0.354839 | 0.258065 | 0.000000 | 0.579439 |
346 | 0.336449 | 0.280374 | 0.205607 | 0.177570 | 0.034483 | 0.206897 | 0.172414 | 0.310345 | 0.172414 | 0.103448 | 0.542056 |
457 | 0.700935 | 0.056075 | 0.168224 | 0.074766 | 0.142857 | 0.142857 | 0.428571 | 0.000000 | 0.285714 | 0.000000 | 0.130841 |
1973 | 0.429907 | 0.214953 | 0.252336 | 0.102804 | 0.000000 | 0.363636 | 0.181818 | 0.227273 | 0.136364 | 0.090909 | 0.411215 |
2384 | 0.411215 | 0.214953 | 0.186916 | 0.186916 | 0.000000 | 0.291667 | 0.208333 | 0.208333 | 0.166667 | 0.125000 | 0.448598 |
409 | 0.280374 | 0.214953 | 0.280374 | 0.224299 | 0.055556 | 0.277778 | 0.111111 | 0.444444 | 0.111111 | 0.000000 | 0.336449 |
1992 | 0.401869 | 0.252336 | 0.186916 | 0.158879 | 0.000000 | 0.333333 | 0.185185 | 0.148148 | 0.296296 | 0.037037 | 0.504673 |
1919 | 0.429907 | 0.196262 | 0.186916 | 0.186916 | 0.000000 | 0.233333 | 0.233333 | 0.233333 | 0.300000 | 0.000000 | 0.560748 |
1990 | 0.336449 | 0.271028 | 0.196262 | 0.196262 | 0.031250 | 0.187500 | 0.218750 | 0.250000 | 0.281250 | 0.031250 | 0.598131 |
1684 | 0.327103 | 0.280374 | 0.242991 | 0.149533 | 0.034483 | 0.206897 | 0.172414 | 0.413793 | 0.137931 | 0.034483 | 0.542056 |
1118 | 0.289720 | 0.233645 | 0.336449 | 0.140187 | 0.000000 | 0.290323 | 0.161290 | 0.387097 | 0.161290 | 0.000000 | 0.579439 |
311 | 0.308411 | 0.252336 | 0.196262 | 0.242991 | 0.090909 | 0.242424 | 0.242424 | 0.181818 | 0.151515 | 0.090909 | 0.616822 |
1487 | 0.392523 | 0.242991 | 0.242991 | 0.121495 | 0.000000 | 0.310345 | 0.137931 | 0.379310 | 0.172414 | 0.000000 | 0.542056 |
252 | 0.626168 | 0.205607 | 0.093458 | 0.074766 | 0.142857 | 0.428571 | 0.000000 | 0.142857 | 0.142857 | 0.142857 | 0.130841 |
2348 | 0.691589 | 0.168224 | 0.084112 | 0.056075 | 0.000000 | 0.000000 | 0.285714 | 0.428571 | 0.285714 | 0.000000 | 0.130841 |
155 | 0.373832 | 0.261682 | 0.196262 | 0.168224 | 0.000000 | 0.360000 | 0.280000 | 0.240000 | 0.120000 | 0.000000 | 0.467290 |
1388 | 0.392523 | 0.233645 | 0.252336 | 0.121495 | 0.033333 | 0.366667 | 0.200000 | 0.266667 | 0.133333 | 0.000000 | 0.560748 |
1507 | 0.439252 | 0.345794 | 0.177570 | 0.037383 | 0.000000 | 0.333333 | 0.000000 | 0.444444 | 0.222222 | 0.000000 | 0.168224 |
649 | 0.345794 | 0.224299 | 0.280374 | 0.149533 | 0.000000 | 0.258065 | 0.193548 | 0.387097 | 0.161290 | 0.000000 | 0.579439 |
1800 | 0.383178 | 0.252336 | 0.271028 | 0.093458 | 0.000000 | 0.450000 | 0.000000 | 0.550000 | 0.000000 | 0.000000 | 0.373832 |
1540 | 0.308411 | 0.252336 | 0.252336 | 0.186916 | 0.064516 | 0.258065 | 0.129032 | 0.322581 | 0.129032 | 0.096774 | 0.579439 |
1873 | 0.429907 | 0.214953 | 0.233645 | 0.121495 | 0.000000 | 0.250000 | 0.250000 | 0.375000 | 0.125000 | 0.000000 | 0.448598 |
953 rows × 11 columns
ql.reset()
qgraph = ql.get_regressor(train.drop(drop_cols, axis=1).columns.to_list(), target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))\
.filter(feyn.filters.Contains(['U_percent', 'pairs_rate']))
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
Since the diversity of graphs have increased, this means that there are many competing solutions to the question. We sometimes see A-U
or U-G
getting the third feature and sometimes no third feature is chosen.
However if you run the above three cells several times then you will see that A_percent
and G_percent
tend to be the two best candidates for the third feature.
This means we don't have a fixed hypothesis this time which leads us to reformulate the previous question into two new ones.
U_percent
, pairs_rate
and G_percent
combine to explain the target variable?
Question 4b: How does ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'G_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
best_U_pr_G_perc = qgraph[0]
mean_deg_Mg_pH10
has this above relation between G_percent
, U_percent
, and pairs_rate
Hypothesis: Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. In all the runs of this notebook so far, this hypothesis has been the winner. Remember that the graphs are the hypotheses and they compete against each other. This means that depending on the data, the top graph may differ between different QLattice
runs.
best_U_pr_G_perc.r2_score(train)
0.489082790675745
best_U_pr_G_perc.r2_score(valid)
0.4801340928388709
U_percent
, pairs_rate
and A_percent
combine to explain the target variable?
Question 4c: How does ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'A_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
best_U_pr_A_perc = qgraph[0]
mean_deg_Mg_pH10
has this above relation between A_percent
, U_percent
, and pairs_rate
Hypothesis: Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. Do your results agree with this? Again run the above cells many times and see which hypothesis tends to occur the most.
best_U_pr_A_perc.r2_score(train)
0.5134878825008996
best_U_pr_A_perc.r2_score(valid)
0.5079327377293561
Here we can see that A_percent
performs better than G_percent
in terms of R2 score on train and validation sets. This means we favour the hypothesis that takes A_percent
as the third feature over G_percent
.
Let's further explore the model we just got
best_U_pr_A_perc.sympify(signif=3)
From resetting the QLattice and running the loop a few times, we usually see that pairs_rate
and A_percent
go into a gaussian while U_percent
enters an ascending function (sometimes a gaussian, sometimes a exp). Afterwards they interact in a multiplicative fashion.
Let's check the partial plots
best_U_pr_A_perc.plot_partial(train,by='A_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6], 'U_percent' : train['U_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean(), 'A_percent' : train['A_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='U_percent', fixed = {'pairs_rate' : [0.2,0.35,0.5,0.6], 'A_percent' : train['A_percent'].mean()})
We can see that A_percent
has an island of points at high values but with low mean_deg_Mg_pH10
. This is why the QLattice
chose the gaussian to match it. Observe that the pairs_rate
behaves similarly as the graph from Question 3.
Concluding remarks
The graph that contains the most signal in terms of R2 score takes U_percent
, pairs_rate
and A_percent
. We usually see that A_percent
and pairs_rate
fit to gaussians. What this means is that at lower and higher rates of these values, the mean_deg_Mg_pH10
are quite low. However mean degradation increases as values move towards the middle. In other words there is a hot spot where mean degradation is at its highest. For A_percent
this is usually around 0.42 and pairs_rate
~0.36. In contrast to A_percent
and pairs_rate
, we do not have samples where U_percent
goes above 0.35. Since U_percent
follows an upwards trend in this region, 0.35 would yield the highest mean degradation.
Let's settle on this hypothesis that A_percent
, U_percent
and pairs_rate
explain the mean degradation of the mRNA samples according to the expression below:
best_U_pr_A_perc.sympify(signif=2)
Ideally we want to test this hypothesis on data gathered from a new experiment. If we can't do that then let's use the holdout set to mimic this.
best_U_pr_A_perc.r2_score(holdo)
0.41346295608630135
See how the R2 score is lower than on our training and validation sets. Since our hypothesis in the "real-world" did not perform as we expected we need to go back and formulate new questions to the data and extract new hypotheses. Notice how this R2-score is similar to when we had only U_percent
and pairs_rate
, this could be a point we can return to.
A big warning here is about reusing your holdout set. If the hypothesis we settled on performs similarly on train, validation and holdout it does not necessarily mean it will generalise to new data. This is because the holdout set is now contaminated since we are biased towards it. Now it means less to perform well on the holdout set.
A simple suggestion, if data allows it, is to have multiple holdout sets. This will mimic performing multiple experiments. Another suggestion could be to reseed you train/valid/holdo split. However you should still take this with a grain of salt because you can still hold biases.
Suggestions for next steps
Well done on completing this tutorial! Here are some suggestions you can do for a bit of extra fun:
- Try increasing complexity a little bit more. See if you get a higher R2 score in the train and validation sets when you include both
A_percent
andG_percent
along withU_percent
andpairs_rate
. You do this by removing theMaxEdges
filter and keepingmax_depth
to two. - Try this out for the other average measurment values:
mean_reactivity
,mean_deg_50C
etc. Do we obtain the same gaussian behaviour as we do withmean_deg_Mg_pH10
? - The QLattice hasn't picked up on the specific pairs. What happens if you force the
QLattice
to include them?
Remember to ALWAYS ask a question before extracting hypotheses (QGraph
) from the QLattice
. The scientific method is your guiding principle when using the QLattice
in your quest for truth!