Covid-19 vaccination RNA dataset.

Feyn version: 2.0 or newer

In this tutorial we are going to go perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.

Some of the Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.

The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at $50^o \mathrm{C}$ with and without magnesium.

In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.

Let's take a look!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import feyn

from sklearn.model_selection import train_test_split

data = pd.read_csv('../data/covid_simple.csv')
data.head()

	A_percent	G_percent	C_percent	U_percent	U-G	C-G	U-A	G-C	A-U	G-U	pairs_rate	mean_deg_Mg_pH10
0	0.420561	0.177570	0.214953	0.186916	0.086957	0.130435	0.260870	0.347826	0.173913	0.000	0.429907	0.559628
1	0.401869	0.224299	0.186916	0.186916	0.041667	0.208333	0.208333	0.291667	0.125000	0.125	0.448598	0.578362
2	0.317757	0.308411	0.308411	0.065421	0.040000	0.320000	0.040000	0.480000	0.120000	0.000	0.467290	0.411699
3	0.383178	0.224299	0.224299	0.168224	0.000000	0.370370	0.148148	0.222222	0.259259	0.000	0.504673	0.353104
4	0.345794	0.261682	0.252336	0.140187	0.032258	0.322581	0.161290	0.258065	0.225806	0.000	0.579439	0.443834

train, test = train_test_split(data, random_state=42)

output_name = 'mean_deg_Mg_pH10'

We're now going to train the QLattice on this problem. We will use the auto_run function to run an automatic simulation, and we will just do a cursory run to get indications we can learn from.

ql = feyn.connect_qlattice()
ql.reset(random_seed=42)

models = ql.auto_run(
    train,
    output_name=output_name,
    n_epochs=10
    )

The Model on display is the best out of all 20000 Models that the QLattice has tried. We take a closer look with the plot_signal function to find the metrics of this model and if there are some redundancies.

best = models[0]

best.plot_signal(train)

This plot displays the correlation of each step of the model. Reading from left to right, you can see how interactions increase the likeness of the two distributions. Reading from right to left, you can see by the reduction in correlation, how each subtree contributes.

Here we can see that there's a lot of signal that is captured in the gaussian interaction between A_percent and pairs_rate. There's also a signal increase, by adding U_percent, however the U-G does not seem to be adding anything extra to the Model at all.

Based on these observations, we will do another run but this time we will try to reduce the amount of redundancies by reducing the max_complexity to 5.

Then we will let the model above compete with the new simulate runs by passing it into the starting_models parameter. We will let this train for a little longer, to allow it to search the space and give it time to fit the models to completion.

models = ql.auto_run(
    train,
    output_name=output_name,
    max_complexity=5,
    n_epochs=20,
    starting_models=models    
    )

best = models[0]

best.plot(train,test)

Training Metrics

Test

This Model overtook the previous one. It is much slimmer and really captures the important features and interactions. Those are clearly the gaussian with the pairs_rate and A_percent and the multiplication function with U_percent.

So you can see that using this simple workflow gives you a lot of power in finding the features and transformations that really matter to the output. Hvaving this more allows you to find much more interpretable model when we compare to other machine learning methods.

Feyn Documentation

Classification

Regression

Regression

Classification

Regression

Covid-19 vaccination RNA dataset.

Training Metrics

Test

Subscribe to get news about Feyn and the QLattice.