Covid-19 vaccination RNA dataset.
by: Meera Machado & Chris Cave
Feyn version: 2.0 or newer
In this tutorial we are going to go perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.
Some of the Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.
The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at with and without magnesium.
In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.
Let's take a look!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import feyn
from sklearn.model_selection import train_test_split
data = pd.read_csv('../data/covid_simple.csv')
data.head()
A_percent | G_percent | C_percent | U_percent | U-G | C-G | U-A | G-C | A-U | G-U | pairs_rate | mean_deg_Mg_pH10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.420561 | 0.177570 | 0.214953 | 0.186916 | 0.086957 | 0.130435 | 0.260870 | 0.347826 | 0.173913 | 0.000 | 0.429907 | 0.559628 |
1 | 0.401869 | 0.224299 | 0.186916 | 0.186916 | 0.041667 | 0.208333 | 0.208333 | 0.291667 | 0.125000 | 0.125 | 0.448598 | 0.578362 |
2 | 0.317757 | 0.308411 | 0.308411 | 0.065421 | 0.040000 | 0.320000 | 0.040000 | 0.480000 | 0.120000 | 0.000 | 0.467290 | 0.411699 |
3 | 0.383178 | 0.224299 | 0.224299 | 0.168224 | 0.000000 | 0.370370 | 0.148148 | 0.222222 | 0.259259 | 0.000 | 0.504673 | 0.353104 |
4 | 0.345794 | 0.261682 | 0.252336 | 0.140187 | 0.032258 | 0.322581 | 0.161290 | 0.258065 | 0.225806 | 0.000 | 0.579439 | 0.443834 |
train, test = train_test_split(data, random_state=42)
output_name = 'mean_deg_Mg_pH10'
We're now going to train the QLattice
on this problem. We will use the auto_run
function to run an automatic simulation, and we will just do a cursory run to get indications we can learn from.
ql = feyn.connect_qlattice()
ql.reset(random_seed=42)
models = ql.auto_run(
train,
output_name=output_name,
n_epochs=10
)
The Model
on display is the best out of all 20000 Models
that the QLattice
has tried. We take a closer look with the plot_signal
function to find the metrics of this model and if there are some redundancies.
best = models[0]
best.plot_signal(train)
This plot displays the correlation of each step of the model. Reading from left to right, you can see how interactions increase the likeness of the two distributions. Reading from right to left, you can see by the reduction in correlation, how each subtree contributes.
Here we can see that there's a lot of signal that is captured in the gaussian
interaction between A_percent
and pairs_rate
.
There's also a signal increase, by adding U_percent
, however the U-G
does not seem to be adding anything extra to the Model
at all.
Based on these observations, we will do another run but this time we will try to reduce the amount of redundancies by reducing the max_complexity
to 5.
Then we will let the model above compete with the new simulate runs by passing it into the starting_models
parameter.
We will let this train for a little longer, to allow it to search the space and give it time to fit the models to completion.
models = ql.auto_run(
train,
output_name=output_name,
max_complexity=5,
n_epochs=20,
starting_models=models
)
best = models[0]
best.plot(train,test)
Training Metrics
Test
This Model
overtook the previous one. It is much slimmer and really captures the important features and interactions. Those are clearly the gaussian
with the pairs_rate
and A_percent
and the multiplication
function with U_percent
.
So you can see that using this simple workflow gives you a lot of power in finding the features and transformations that really matter to the output. Hvaving this more allows you to find much more interpretable model when we compare to other machine learning methods.