# Covid-19 vaccination RNA dataset.

by: Meera Machado & Chris Cave

Feyn version: 1.4.+

In this tutorial we are going to go through a typical `QLattice`

workflow. We perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.

Some of the potential Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.

The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at $50^o \mathrm{C}$ with and without magnesium.

In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.

Let's take a look!

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import feyn
from sklearn.model_selection import train_test_split
%matplotlib inline
```

The scientific method is core to the workflow of this notebook. We will make **observations**, pose **questions** and form **hypotheses**.

```
# Here is our engineered dataset
eng_data = pd.read_csv('../data/covid_simple.csv')
eng_data.head()
```

A_percent | G_percent | C_percent | U_percent | U-G | C-G | U-A | G-C | A-U | G-U | pairs_rate | SN_filter | mean_reactivity | mean_deg_Mg_pH10 | mean_deg_pH10 | mean_deg_Mg_50C | mean_deg_50C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.420561 | 0.177570 | 0.214953 | 0.186916 | 0.086957 | 0.130435 | 0.260870 | 0.347826 | 0.173913 | 0.000000 | 0.429907 | 1 | 0.502631 | 0.559628 | 0.503501 | 0.538540 | 0.585893 |

1 | 0.233645 | 0.308411 | 0.299065 | 0.158879 | 0.030303 | 0.363636 | 0.030303 | 0.393939 | 0.121212 | 0.060606 | 0.616822 | 0 | 0.411401 | 0.698354 | 0.846041 | 0.930103 | 0.865125 |

2 | 0.401869 | 0.224299 | 0.186916 | 0.186916 | 0.041667 | 0.208333 | 0.208333 | 0.291667 | 0.125000 | 0.125000 | 0.448598 | 1 | 0.433966 | 0.578362 | 0.548174 | 0.570284 | 0.575172 |

3 | 0.261682 | 0.327103 | 0.327103 | 0.084112 | 0.000000 | 0.437500 | 0.062500 | 0.406250 | 0.031250 | 0.062500 | 0.598131 | 0 | 0.329059 | 0.826485 | 1.051551 | 1.221297 | 0.696476 |

4 | 0.542056 | 0.056075 | 0.056075 | 0.345794 | 0.000000 | 0.000000 | 0.457143 | 0.000000 | 0.542857 | 0.000000 | 0.654206 | 0 | 0.282204 | 0.469587 | 0.384826 | 0.393072 | 0.440115 |

There's a feature we haven't talked about, the `SN_filter`

. This is the signal to noise filter capturing which RNA molecules passed the evaluation criteria defined by the Standford researchers. This means we will drop the rows with `SN_filter = 0`

```
eng_data_SN1 = eng_data.query('SN_filter == 1')
```

So now we want to use the `QLattice`

! How do we start?

First we need to choose a target variable. We choose degradation with magnesium at pH10. The higher its value, the higher the chance of the mRNA "breaking" at that nucleotide location. In `eng_data`

we simply took the mean across the nucleotides of each mRNA sample.

This notebook can easily be modified for the different target variables. Please get your hands dirty and mess around with this.

```
target = 'mean_deg_Mg_pH10'
```

We have a chosen a target variable. What do we do next? This is the moment we differ from traditional machine learning techniques.

We need to take a step back and think about what do we want to ask about our data, i.e. what **observations** can we make? Let's start with baby steps:

**Can a single feature already explain some of the output?**

```
plt.figure(figsize=(21,21))
for ix, col in enumerate(eng_data_SN1.columns[:11]):
plt.subplot(3,4, ix+1)
plt.scatter(eng_data_SN1[col], eng_data_SN1[target])
plt.xlabel(col)
```

**Observation**: note how `U_percent`

seems to be linearly correlated to the `mean_deg_Mg_pH10`

.

Let's calculate their Pearson's correlations:

```
drop_cols = ['SN_filter'] + [out for out in eng_data_SN1.columns if 'mean' in out]
```

```
for feat in eng_data_SN1.drop(drop_cols, axis=1).columns:
print(feat, feyn.metrics.calculate_pc(eng_data_SN1[feat], eng_data_SN1['mean_deg_Mg_pH10']))
```

```
A_percent -0.06920561730299386
G_percent -0.15027559191692702
C_percent -0.10138997239078872
U_percent 0.3623464717168081
U-G -0.11540344284319401
C-G -0.006336444226142389
U-A 0.14079264385434193
G-C -0.10153195407020321
A-U 0.07401507079974223
G-U -0.01572146918393174
pairs_rate 0.04032118343747456
```

As we can see, `U_percent`

is certainly the one with highest **linear correlation**!

Before we move on to the question, let's split the data into **train**, **validation** and **holdout** sets:

```
train, test = train_test_split(eng_data_SN1, test_size=0.4, random_state=527140)
valid, holdo = train_test_split(test, test_size=0.5, random_state=5271740)
ql = feyn.QLattice()
```

`U_percent`

relate to the target variable?

Question 1: How does **Always have a question in mind before you start working with the QLattice.**

We want to use this **question** to define the types of graphs we want to extract from the `QLattice`

, i.e. the **question** you form will allow you to form the **hypotheses** to it. Throughout this notebook think of graphs generated by the `QLattice`

as **hypotheses** to your **question**.

You translate your question into the `QLattice`

by defining the `QGraph`

as a regressor or classifier, setting input and output features, fixing the graphs' maximum depth and number of edges, selecting types of funtions, etc. Check out the filters page for more information on this

The `QGraph`

is the **list of hypotheses** generated by the `QLattice`

!

In the current case, we extract a `regressor`

with a single input feature, `U_percent`

, and limit the graphs to depth one. In other words, `U_percent`

is mapped to `mean_deg_Mg_pH10`

with a single function. This will generate **hypotheses** of the type "`U_percent`

has an $f(x)$ relationship to `mean_deg_Mg_pH10`

" which provides answers to the question above.

```
ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)
```

In our limited problem, the number of possible graph architectures is not large. This means 20 iterations should be enough to explore the possibilities and pick out the best graph.

```
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

You can peer into the `QGraph`

by calling its head. It's a good way to check the achitecture diversity of the graphs, i.e. the range of hypotheses that the `QLattice`

have generated

```
qgraph.head()
```

Usually what we see when we run this fitting loop is a **gaussian** or a **sine** function. Try running it a few times, with different amount of fitting loops to see if you get the same results as us.

This is different to our expectation of `U_percent`

predicting the target variable by a linear function. Here `sine`

and `gaussian`

function demonstrates that there could be is a non-linear relation between `U_percent`

and `mean_deg_Mg_pH10`

. This leads us to our first hypothesis:

**Hypothesis**: `U_percent`

has this **gaussian** / **sine** relation to `mean_deg_Mg_pH10`

We test this hypothesis. We use the R2 score to verify whether predictions that come from this hypothesis are better than random guessing. The R2 score is one of many metrics one can use to verify this. Other examples are root mean squared error and Students t-test on the distribution of the residuals.

```
best_U_perc = qgraph[0]
```

```
best_U_perc.r2_score(train)
```

```
0.19256211963333614
```

```
best_U_perc.r2_score(valid)
```

```
0.19674207680952693
```

Note how the R2 scores are similar to the train and validation sets, which means we are not overfitting.

We can also consider the R2 scores above as a baseline for further questions and `QLattice`

explorations.

Let's take a look at our prediction of `mean_deg_Mg_pH10`

vs `U_percent`

```
best_U_perc.plot_partial(train, by='U_percent')
```

```
No handles with labels found to put in legend.
```

What is this curve capturing that a linear function cannot? This leads us to a sub question

`U_percent`

linearly relate to the target variable and how does it compare to the above relation?

Question 1b: How does ```
ql.reset()
qgraph = ql.get_regressor(['U_percent'], target, max_depth=1)\
.filter(feyn.filters.Functions('linear'))
```

```
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

Now we have adjusted the `QLattice`

to form an alternative hypothesis to the previous question: `U_precent`

has a linear relationship to `mean_deg_Mg_pH10`

.

Again we use the R2 score to compare both **hypotheses**.

```
best_U_lin = qgraph[0]
```

```
best_U_lin.r2_score(train)
```

```
0.11813109310478409
```

```
best_U_lin.r2_score(valid)
```

```
0.14485858904785553
```

So we can already see that the R2 scores are not as good as with the gaussian relation. This makes us favour the **gaussian** / **sine** hypothesis over the **linear** one.

Let's plot the line!

```
best_U_lin.plot_partial(train,by='U_percent')
```

```
No handles with labels found to put in legend.
```

So here we can see that there's a lot of points that drag the gradient down. This affects the predicted mean degradation for the small and the large values of `U_percent`

. On the other hand the curve from the non-linear relation has managed to capture the lower `mean_deg_Mg_pH10`

values at the low and high `U_percent`

values. This explains the higher R2 scores for the gaussian.

This indicates that there's more here than meets the eye. There are non-linear relationships to be explored between the features and the output.

From the Eterna game, we can assume that forming pairs is important. We don't know how but that can be the subject of our next question.

`pairs_rate`

relate to the target variable?

Question 2: How does ```
ql.reset()
qgraph = ql.get_regressor(['pairs_rate'], target, max_depth=1)
```

```
for _ in range(20):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

Why not take a look at the head yourself and see how uniform the `QGraph`

is.

**Hypothesis**: `pairs_rate`

has this **gaussian** relationship to `mean_deg_Mg_pH10`

```
best_pairs = qgraph[0]
```

```
best_pairs.r2_score(train)
```

```
0.31523686520756944
```

```
best_pairs.r2_score(valid)
```

```
0.2867096734133877
```

See how the R2 scores have increased compared to just `U_percent`

.

Let's plot this!

```
best_pairs.plot_partial(train, by='pairs_rate')
```

```
No handles with labels found to put in legend.
```

This is another gaussian :p.

Let's get our feet a little wet and dabble in a little complexity.

`U_percent`

and `pairs_rate`

combine to predict the target variable?

Question 3: How does ```
ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate'], target, max_depth=1)
```

```
for _ in range(30):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

Notice here we've increased the number of loops to 30. This is because the search space has increased. There are graphs with two features and also graphs with a single feature. They are competing against each other so we need to allow the `QLattice`

more time to find the strongest graphs.

**Hypothesis**: `U_percent`

and `pairs_rate`

has this **gaussian** relationship to `mean_deg_Mg_pH10`

```
best_U_perc_pairs = qgraph[0]
```

```
best_U_perc_pairs.r2_score(train)
```

```
0.46917256079886294
```

```
best_U_perc_pairs.r2_score(valid)
```

```
0.45529212123027185
```

This is a big jump! Let's investigate further.

```
best_U_perc_pairs.plot_partial(train,by='U_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6]})
best_U_perc_pairs.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean()})
```

This partial plot is a little more complex because our graph is 2D. What we are seeing is how the target varies as *both* `pairs_rate`

and `U_percent`

change.

Interestingly, the **gaussian** on `U_percent`

has changed significantly compared to our first plot where now `U_percent`

follows an upwards trend. Remember from the beginning, `U_percent`

by itself fitted a **gaussian** much better than a **linear** function would. However with the inclusion of `pairs_rate`

, `U_percent`

behaves much more like a **linear** function than a **gaussian**. This is because with the inclusion of `pairs_rate`

we have sectioned `U_percent`

for different `pairs_rate`

values and allowed for fitting multiple lines to it. In other words instead of fitting a single **gaussian** we have different upwards trends for different values of `pairs_rates`

that captures mean degradation better than before.

Look at the top plot with `U_percent`

as the x-axis: notice how the corresponding gaussian moves up when `pairs_rate`

changes from 0.2 to 0.35. Then it moves down when `pairs_rate`

changes from 0.35 to 0.6. This traces out the gaussian in the bottom plot with `pairs_rate`

as the x-axis.

Let's have a look at something a little more complex.

`U_percent`

and `pairs_rate`

, is there a third feature that improves the explaination of the target variable?

Question 4: Given Here we will make sure each graph will contain `U_percent`

and `pairs_rate`

. We will let the other features compete for the last spot. This means we need to increase the amount of fitting loops.

```
train.drop(drop_cols, axis=1)
```

A_percent | G_percent | C_percent | U_percent | U-G | C-G | U-A | G-C | A-U | G-U | pairs_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|

1201 | 0.401869 | 0.205607 | 0.224299 | 0.168224 | 0.043478 | 0.217391 | 0.217391 | 0.304348 | 0.130435 | 0.086957 | 0.429907 |

1138 | 0.485981 | 0.196262 | 0.205607 | 0.112150 | 0.000000 | 0.307692 | 0.153846 | 0.307692 | 0.230769 | 0.000000 | 0.485981 |

56 | 0.392523 | 0.242991 | 0.261682 | 0.102804 | 0.000000 | 0.354839 | 0.193548 | 0.354839 | 0.096774 | 0.000000 | 0.579439 |

1183 | 0.224299 | 0.336449 | 0.280374 | 0.158879 | 0.000000 | 0.416667 | 0.083333 | 0.333333 | 0.166667 | 0.000000 | 0.448598 |

1316 | 0.700935 | 0.158879 | 0.074766 | 0.065421 | 0.000000 | 0.285714 | 0.285714 | 0.000000 | 0.428571 | 0.000000 | 0.130841 |

2084 | 0.598131 | 0.074766 | 0.093458 | 0.233645 | 0.000000 | 0.074074 | 0.407407 | 0.074074 | 0.444444 | 0.000000 | 0.504673 |

2316 | 0.411215 | 0.252336 | 0.196262 | 0.140187 | 0.045455 | 0.272727 | 0.136364 | 0.318182 | 0.227273 | 0.000000 | 0.411215 |

40 | 0.373832 | 0.271028 | 0.271028 | 0.084112 | 0.000000 | 0.407407 | 0.074074 | 0.407407 | 0.111111 | 0.000000 | 0.504673 |

1476 | 0.457944 | 0.242991 | 0.214953 | 0.084112 | 0.000000 | 0.500000 | 0.181818 | 0.181818 | 0.136364 | 0.000000 | 0.411215 |

1619 | 0.355140 | 0.271028 | 0.196262 | 0.177570 | 0.066667 | 0.266667 | 0.166667 | 0.233333 | 0.233333 | 0.033333 | 0.560748 |

2117 | 0.392523 | 0.214953 | 0.196262 | 0.196262 | 0.058824 | 0.117647 | 0.235294 | 0.323529 | 0.235294 | 0.029412 | 0.635514 |

1843 | 0.327103 | 0.280374 | 0.261682 | 0.130841 | 0.035714 | 0.357143 | 0.142857 | 0.285714 | 0.107143 | 0.071429 | 0.523364 |

2342 | 0.485981 | 0.168224 | 0.140187 | 0.205607 | 0.000000 | 0.259259 | 0.407407 | 0.074074 | 0.259259 | 0.000000 | 0.504673 |

1550 | 0.242991 | 0.336449 | 0.327103 | 0.093458 | 0.032258 | 0.483871 | 0.096774 | 0.322581 | 0.000000 | 0.064516 | 0.579439 |

1392 | 0.317757 | 0.261682 | 0.130841 | 0.289720 | 0.216216 | 0.081081 | 0.189189 | 0.135135 | 0.162162 | 0.216216 | 0.691589 |

1932 | 0.392523 | 0.196262 | 0.205607 | 0.205607 | 0.037037 | 0.074074 | 0.222222 | 0.370370 | 0.259259 | 0.037037 | 0.504673 |

281 | 0.308411 | 0.289720 | 0.242991 | 0.158879 | 0.071429 | 0.321429 | 0.214286 | 0.321429 | 0.000000 | 0.071429 | 0.523364 |

529 | 0.392523 | 0.233645 | 0.280374 | 0.093458 | 0.043478 | 0.391304 | 0.173913 | 0.347826 | 0.043478 | 0.000000 | 0.429907 |

948 | 0.280374 | 0.336449 | 0.299065 | 0.084112 | 0.030303 | 0.393939 | 0.060606 | 0.393939 | 0.090909 | 0.030303 | 0.616822 |

2075 | 0.392523 | 0.196262 | 0.205607 | 0.205607 | 0.035714 | 0.214286 | 0.285714 | 0.214286 | 0.250000 | 0.000000 | 0.523364 |

360 | 0.476636 | 0.186916 | 0.205607 | 0.130841 | 0.000000 | 0.307692 | 0.192308 | 0.269231 | 0.230769 | 0.000000 | 0.485981 |

2140 | 0.355140 | 0.242991 | 0.242991 | 0.158879 | 0.000000 | 0.206897 | 0.275862 | 0.344828 | 0.172414 | 0.000000 | 0.542056 |

443 | 0.429907 | 0.196262 | 0.196262 | 0.177570 | 0.000000 | 0.172414 | 0.241379 | 0.344828 | 0.241379 | 0.000000 | 0.542056 |

2293 | 0.271028 | 0.299065 | 0.224299 | 0.205607 | 0.142857 | 0.250000 | 0.071429 | 0.214286 | 0.107143 | 0.214286 | 0.523364 |

389 | 0.327103 | 0.252336 | 0.130841 | 0.289720 | 0.216216 | 0.081081 | 0.162162 | 0.135135 | 0.216216 | 0.189189 | 0.691589 |

2299 | 0.411215 | 0.224299 | 0.158879 | 0.205607 | 0.071429 | 0.142857 | 0.285714 | 0.214286 | 0.250000 | 0.035714 | 0.523364 |

1675 | 0.383178 | 0.252336 | 0.233645 | 0.130841 | 0.038462 | 0.307692 | 0.076923 | 0.307692 | 0.269231 | 0.000000 | 0.485981 |

1294 | 0.373832 | 0.289720 | 0.261682 | 0.074766 | 0.000000 | 0.409091 | 0.090909 | 0.409091 | 0.045455 | 0.045455 | 0.411215 |

1727 | 0.476636 | 0.186916 | 0.158879 | 0.177570 | 0.000000 | 0.214286 | 0.357143 | 0.178571 | 0.214286 | 0.035714 | 0.523364 |

1677 | 0.401869 | 0.242991 | 0.233645 | 0.121495 | 0.033333 | 0.233333 | 0.066667 | 0.400000 | 0.266667 | 0.000000 | 0.560748 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

2127 | 0.364486 | 0.271028 | 0.242991 | 0.121495 | 0.045455 | 0.409091 | 0.136364 | 0.318182 | 0.045455 | 0.045455 | 0.411215 |

385 | 0.467290 | 0.186916 | 0.205607 | 0.140187 | 0.000000 | 0.285714 | 0.178571 | 0.285714 | 0.250000 | 0.000000 | 0.523364 |

782 | 0.317757 | 0.224299 | 0.261682 | 0.196262 | 0.000000 | 0.333333 | 0.133333 | 0.200000 | 0.266667 | 0.066667 | 0.560748 |

1058 | 0.401869 | 0.205607 | 0.205607 | 0.186916 | 0.035714 | 0.178571 | 0.285714 | 0.214286 | 0.214286 | 0.071429 | 0.523364 |

1511 | 0.271028 | 0.383178 | 0.327103 | 0.018692 | 0.000000 | 0.500000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.485981 |

2239 | 0.476636 | 0.224299 | 0.149533 | 0.149533 | 0.045455 | 0.227273 | 0.181818 | 0.227273 | 0.227273 | 0.090909 | 0.411215 |

934 | 0.327103 | 0.261682 | 0.271028 | 0.140187 | 0.000000 | 0.354839 | 0.129032 | 0.322581 | 0.193548 | 0.000000 | 0.579439 |

902 | 0.439252 | 0.196262 | 0.196262 | 0.168224 | 0.000000 | 0.200000 | 0.266667 | 0.266667 | 0.066667 | 0.200000 | 0.280374 |

737 | 0.345794 | 0.224299 | 0.280374 | 0.149533 | 0.000000 | 0.225806 | 0.161290 | 0.354839 | 0.258065 | 0.000000 | 0.579439 |

346 | 0.336449 | 0.280374 | 0.205607 | 0.177570 | 0.034483 | 0.206897 | 0.172414 | 0.310345 | 0.172414 | 0.103448 | 0.542056 |

457 | 0.700935 | 0.056075 | 0.168224 | 0.074766 | 0.142857 | 0.142857 | 0.428571 | 0.000000 | 0.285714 | 0.000000 | 0.130841 |

1973 | 0.429907 | 0.214953 | 0.252336 | 0.102804 | 0.000000 | 0.363636 | 0.181818 | 0.227273 | 0.136364 | 0.090909 | 0.411215 |

2384 | 0.411215 | 0.214953 | 0.186916 | 0.186916 | 0.000000 | 0.291667 | 0.208333 | 0.208333 | 0.166667 | 0.125000 | 0.448598 |

409 | 0.280374 | 0.214953 | 0.280374 | 0.224299 | 0.055556 | 0.277778 | 0.111111 | 0.444444 | 0.111111 | 0.000000 | 0.336449 |

1992 | 0.401869 | 0.252336 | 0.186916 | 0.158879 | 0.000000 | 0.333333 | 0.185185 | 0.148148 | 0.296296 | 0.037037 | 0.504673 |

1919 | 0.429907 | 0.196262 | 0.186916 | 0.186916 | 0.000000 | 0.233333 | 0.233333 | 0.233333 | 0.300000 | 0.000000 | 0.560748 |

1990 | 0.336449 | 0.271028 | 0.196262 | 0.196262 | 0.031250 | 0.187500 | 0.218750 | 0.250000 | 0.281250 | 0.031250 | 0.598131 |

1684 | 0.327103 | 0.280374 | 0.242991 | 0.149533 | 0.034483 | 0.206897 | 0.172414 | 0.413793 | 0.137931 | 0.034483 | 0.542056 |

1118 | 0.289720 | 0.233645 | 0.336449 | 0.140187 | 0.000000 | 0.290323 | 0.161290 | 0.387097 | 0.161290 | 0.000000 | 0.579439 |

311 | 0.308411 | 0.252336 | 0.196262 | 0.242991 | 0.090909 | 0.242424 | 0.242424 | 0.181818 | 0.151515 | 0.090909 | 0.616822 |

1487 | 0.392523 | 0.242991 | 0.242991 | 0.121495 | 0.000000 | 0.310345 | 0.137931 | 0.379310 | 0.172414 | 0.000000 | 0.542056 |

252 | 0.626168 | 0.205607 | 0.093458 | 0.074766 | 0.142857 | 0.428571 | 0.000000 | 0.142857 | 0.142857 | 0.142857 | 0.130841 |

2348 | 0.691589 | 0.168224 | 0.084112 | 0.056075 | 0.000000 | 0.000000 | 0.285714 | 0.428571 | 0.285714 | 0.000000 | 0.130841 |

155 | 0.373832 | 0.261682 | 0.196262 | 0.168224 | 0.000000 | 0.360000 | 0.280000 | 0.240000 | 0.120000 | 0.000000 | 0.467290 |

1388 | 0.392523 | 0.233645 | 0.252336 | 0.121495 | 0.033333 | 0.366667 | 0.200000 | 0.266667 | 0.133333 | 0.000000 | 0.560748 |

1507 | 0.439252 | 0.345794 | 0.177570 | 0.037383 | 0.000000 | 0.333333 | 0.000000 | 0.444444 | 0.222222 | 0.000000 | 0.168224 |

649 | 0.345794 | 0.224299 | 0.280374 | 0.149533 | 0.000000 | 0.258065 | 0.193548 | 0.387097 | 0.161290 | 0.000000 | 0.579439 |

1800 | 0.383178 | 0.252336 | 0.271028 | 0.093458 | 0.000000 | 0.450000 | 0.000000 | 0.550000 | 0.000000 | 0.000000 | 0.373832 |

1540 | 0.308411 | 0.252336 | 0.252336 | 0.186916 | 0.064516 | 0.258065 | 0.129032 | 0.322581 | 0.129032 | 0.096774 | 0.579439 |

1873 | 0.429907 | 0.214953 | 0.233645 | 0.121495 | 0.000000 | 0.250000 | 0.250000 | 0.375000 | 0.125000 | 0.000000 | 0.448598 |

953 rows × 11 columns

```
ql.reset()
qgraph = ql.get_regressor(train.drop(drop_cols, axis=1).columns.to_list(), target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))\
.filter(feyn.filters.Contains(['U_percent', 'pairs_rate']))
```

```
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

Since the diversity of graphs have increased, this means that there are many competing solutions to the question. We sometimes see `A-U`

or `U-G`

getting the third feature and sometimes no third feature is chosen.

However if you run the above three cells several times then you will see that `A_percent`

and `G_percent`

tend to be the two best candidates for the third feature.

This means we don't have a fixed hypothesis this time which leads us to reformulate the previous question into two new ones.

`U_percent`

, `pairs_rate`

and `G_percent`

combine to explain the target variable?

Question 4b: How does ```
ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'G_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
```

```
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

```
best_U_pr_G_perc = qgraph[0]
```

**Hypothesis**: `mean_deg_Mg_pH10`

has this above relation between `G_percent`

, `U_percent`

, and `pairs_rate`

Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. In all the runs of this notebook so far, this hypothesis has been the winner. Remember that the graphs are the hypotheses and they compete against each other. This means that depending on the data, the top graph may differ between different `QLattice`

runs.

```
best_U_pr_G_perc.r2_score(train)
```

```
0.489082790675745
```

```
best_U_pr_G_perc.r2_score(valid)
```

```
0.4801340928388709
```

`U_percent`

, `pairs_rate`

and `A_percent`

combine to explain the target variable?

Question 4c: How does ```
ql.reset()
qgraph = ql.get_regressor(['U_percent', 'pairs_rate', 'A_percent'], target, max_depth=2)\
.filter(feyn.filters.MaxEdges(6))
```

```
for _ in range(50):
qgraph.fit(train, threads=6)
ql.update(qgraph.best())
```

```
best_U_pr_A_perc = qgraph[0]
```

**Hypothesis**: `mean_deg_Mg_pH10`

has this above relation between `A_percent`

, `U_percent`

, and `pairs_rate`

Usually we see a three dimensional gaussian in this hypothesis however sometimes we see variants on this. Do your results agree with this? Again run the above cells many times and see which hypothesis tends to occur the most.

```
best_U_pr_A_perc.r2_score(train)
```

```
0.5134878825008996
```

```
best_U_pr_A_perc.r2_score(valid)
```

```
0.5079327377293561
```

Here we can see that `A_percent`

performs better than `G_percent`

in terms of R2 score on train and validation sets. This means we favour the hypothesis that takes `A_percent`

as the third feature over `G_percent`

.

Let's further explore the model we just got

```
best_U_pr_A_perc.sympify(signif=3)
```

$\displaystyle - 0.156 \left(- 5.57 U_{percent} - 1.61\right) e^{- 7.81 \left(0.417 - A_{percent}\right)^{2} - 7.79 \left(pairs_{rate} - 0.357\right)^{2}} + 0.443$

From resetting the QLattice and running the loop a few times, we usually see that `pairs_rate`

and `A_percent`

go into a **gaussian** while `U_percent`

enters an ascending function (sometimes a **gaussian**, sometimes a **exp**). Afterwards they interact in a multiplicative fashion.

Let's check the partial plots

```
best_U_pr_A_perc.plot_partial(train,by='A_percent', fixed = {'pairs_rate':[0.2,0.35,0.5,0.6], 'U_percent' : train['U_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='pairs_rate', fixed = {'U_percent' : train['U_percent'].mean(), 'A_percent' : train['A_percent'].mean()})
best_U_pr_A_perc.plot_partial(train,by='U_percent', fixed = {'pairs_rate' : [0.2,0.35,0.5,0.6], 'A_percent' : train['A_percent'].mean()})
```

We can see that `A_percent`

has an island of points at high values but with low `mean_deg_Mg_pH10`

. This is why the `QLattice`

chose the **gaussian** to match it. Observe that the `pairs_rate`

behaves similarly as the graph from Question 3.

# Concluding remarks

The graph that contains the most signal in terms of R2 score takes `U_percent`

, `pairs_rate`

and `A_percent`

. We usually see that `A_percent`

and `pairs_rate`

fit to **gaussians**. What this means is that at lower and higher rates of these values, the `mean_deg_Mg_pH10`

are quite low. However mean degradation increases as values move towards the middle. In other words there is a hot spot where mean degradation is at its highest. For `A_percent`

this is usually around 0.42 and `pairs_rate`

~0.36. In contrast to `A_percent`

and `pairs_rate`

, we do not have samples where `U_percent`

goes above 0.35. Since `U_percent`

follows an upwards trend in this region, 0.35 would yield the highest mean degradation.

Let's settle on this hypothesis that `A_percent`

, `U_percent`

and `pairs_rate`

explain the mean degradation of the mRNA samples according to the expression below:

```
best_U_pr_A_perc.sympify(signif=2)
```

$\displaystyle - 0.16 \left(- 5.6 U_{percent} - 1.6\right) e^{- 7.8 \left(0.42 - A_{percent}\right)^{2} - 7.8 \left(pairs_{rate} - 0.36\right)^{2}} + 0.44$

Ideally we want to test this hypothesis on data gathered from a new experiment. If we can't do that then let's use the holdout set to mimic this.

```
best_U_pr_A_perc.r2_score(holdo)
```

```
0.41346295608630135
```

See how the R2 score is lower than on our training and validation sets. Since our hypothesis in the "real-world" did not perform as we expected we need to go back and formulate new questions to the data and extract new hypotheses. Notice how this R2-score is similar to when we had only `U_percent`

and `pairs_rate`

, this could be a point we can return to.

A big warning here is about reusing your holdout set. If the hypothesis we settled on performs similarly on train, validation and holdout it does not necessarily mean it will generalise to new data. This is because the holdout set is now contaminated since we are biased towards it. Now it means less to perform well on the holdout set.

A simple suggestion, if data allows it, is to have multiple holdout sets. This will mimic performing multiple experiments. Another suggestion could be to reseed you train/valid/holdo split. However you should still take this with a grain of salt because you can still hold biases.

# Suggestions for next steps

Well done on completing this tutorial! Here are some suggestions you can do for a bit of extra fun:

- Try increasing complexity a little bit more. See if you get a higher R2 score in the train and validation sets when you include both
`A_percent`

and`G_percent`

along with`U_percent`

and`pairs_rate`

. You do this by removing the`MaxEdges`

filter and keeping`max_depth`

to two. - Try this out for the other average measurment values:
`mean_reactivity`

,`mean_deg_50C`

etc. Do we obtain the same**gaussian**behaviour as we do with`mean_deg_Mg_pH10`

? - The QLattice hasn't picked up on the specific pairs. What happens if you force the
`QLattice`

to include them?

Remember to ALWAYS ask a question before extracting hypotheses (`QGraph`

) from the `QLattice`

. The scientific method is your guiding principle when using the `QLattice`

in your quest for truth!