Feyn version: 2.0 or newer
In this tutorial we are going to go perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.
Some of the Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.
The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at with and without magnesium.
In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.
Let's take a look!
import numpy as np import pandas as pd import matplotlib.pyplot as plt import feyn from sklearn.model_selection import train_test_split
data = pd.read_csv('../data/covid_simple.csv') data.head()
train, test = train_test_split(data, test_size=0.4, random_state=42) valid, holdout = train_test_split(test, test_size=0.5, random_state=42)
output_name = 'mean_deg_Mg_pH10'
We're now going to train the
QLattice on this problem. We will use the
auto_run function to run an automatic simulation.
ql = feyn.connect_qlattice() ql.reset(random_seed=123)
models = ql.auto_run( train, output_name=output_name, )
Model on display is the best out of all 20000
Models that the
QLattice has tried. We take a closer look with the
plot function to find the metrics of this model and if there are some redundancies.
best = models
Here we can see that there's a lot of importance in the
gaussian interaction with
pairs_rate. There's also importance with the
U_percent however the
G_percent does not seem to be adding anything extra to the
We will do another run but this time we will try to reduce the amount of redundancies by using bic for a criterion and reduce the
max_complexity to 5.
Then we will let the model above compete with the new simulate runs by passing it into the
models = ql.auto_run( train, output_name=output_name, criterion='bic', max_complexity=5, n_epochs=50, starting_models=models )
best = models
Model overtook the previous one. It is much slimmer and really captures the important features and interactions. Those are clearly the
gaussian with the
A_percent and the
multiplication function with
So you can see that using this simple workflow gives you a lot of power in finding the features and transformations that really matter to the output. Hvaving this more allows you to find much more interpretable model when we compare to other machine learning methods.