Covid-19 vaccination RNA dataset.
by: Meera Machado & Chris Cave
Feyn version: 2.0 or newer
In this tutorial we are going to go perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.
Some of the Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.
The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at with and without magnesium.
In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.
Let's take a look!
import numpy as np import pandas as pd import matplotlib.pyplot as plt import feyn from sklearn.model_selection import train_test_split
data = pd.read_csv('../data/covid_simple.csv') data.head()
train, test = train_test_split(data, random_state=42)
output_name = 'mean_deg_Mg_pH10'
We're now going to train the
QLattice on this problem. We will use the
auto_run function to run an automatic simulation, and we will just do a cursory run to get indications we can learn from.
ql = feyn.connect_qlattice() ql.reset(random_seed=42)
models = ql.auto_run( train, output_name=output_name, n_epochs=10 )
Model on display is the best out of all 20000
Models that the
QLattice has tried. We take a closer look with the
plot_signal function to find the metrics of this model and if there are some redundancies.
best = models
This plot displays the correlation of each step of the model. Reading from left to right, you can see how interactions increase the likeness of the two distributions. Reading from right to left, you can see by the reduction in correlation, how each subtree contributes.
Here we can see that there's a lot of signal that is captured in the
gaussian interaction between
There's also a signal increase, by adding
U_percent, however the
U-G does not seem to be adding anything extra to the
Model at all.
Based on these observations, we will do another run but this time we will try to reduce the amount of redundancies by reducing the
max_complexity to 5.
Then we will let the model above compete with the new simulate runs by passing it into the
We will let this train for a little longer, to allow it to search the space and give it time to fit the models to completion.
models = ql.auto_run( train, output_name=output_name, max_complexity=5, n_epochs=20, starting_models=models )
best = models
Model overtook the previous one. It is much slimmer and really captures the important features and interactions. Those are clearly the
gaussian with the
A_percent and the
multiplication function with
So you can see that using this simple workflow gives you a lot of power in finding the features and transformations that really matter to the output. Hvaving this more allows you to find much more interpretable model when we compare to other machine learning methods.