Feyn Documentation

Feyn Documentation

  • Learn
  • Guides
  • Tutorials
  • API Reference
  • FAQ

›Archive

Overview

  • Tutorials

Beginner

    Classification

    • Titanic survival
    • Pulsar stars
    • Poisonous Mushrooms

    Regression

    • Airbnb prices
    • Automobile MPG
    • Concrete strength

Advanced

    Regression

    • Wine Quality

Use cases

  • Rewriting models with correlated inputs
  • Complexity-Loss Trade-Off
  • Plotting the loss graph
  • Simple linear and logistic regression
  • Deploy a model for inference

Life Sciences

    Classification

    • Detecting Liver Cancer (HCC) in Plasma
    • Classifying toxicity of antisense oligonucleotides

    Regression

    • Covid-19 RNA vaccine degradation data set
    • Preventing the Honeybee Apocalypse (QSAR)

Interfacing with R

  • Classifying toxicity of antisense oligonucleotides

Archive

  • Covid-19 vaccination RNA dataset.

Covid-19 vaccination RNA dataset.

by: Meera Machado & Chris Cave

Feyn version: 2.0 or newer

In this tutorial we are going to go perform a simple analysis on the OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction dataset from Kaggle.

Some of the Covid-19 vaccines are mRNA based. However due to the unstable nature of mRNA they must be refrigerated in extreme conditions. What this means is that distribution of the vaccine will be problematic.

The raw dataset (which can be found on the Kaggle competition) consists of 2400 mRNA samples. Each mRNA consists of 107 nucleotides and various measurments were performed on the first 68 nucleotides. This consisted of reactivity, degradation at pH10 with and without magnesium, and degradation at 50oC50^o \mathrm{C}50oC with and without magnesium.

In our analysis, we have simplified the dataset. For each mRNA, we counted each nucleotide and their pairs along with the total amount of pairs in the molecule. We took the average reactivity and degradation measurements across the nucleotides of each mRNA sample.

Let's take a look!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import feyn

from sklearn.model_selection import train_test_split
data = pd.read_csv('../data/covid_simple.csv')
data.head()
A_percent G_percent C_percent U_percent U-G C-G U-A G-C A-U G-U pairs_rate mean_deg_Mg_pH10
0 0.420561 0.177570 0.214953 0.186916 0.086957 0.130435 0.260870 0.347826 0.173913 0.000 0.429907 0.559628
1 0.401869 0.224299 0.186916 0.186916 0.041667 0.208333 0.208333 0.291667 0.125000 0.125 0.448598 0.578362
2 0.317757 0.308411 0.308411 0.065421 0.040000 0.320000 0.040000 0.480000 0.120000 0.000 0.467290 0.411699
3 0.383178 0.224299 0.224299 0.168224 0.000000 0.370370 0.148148 0.222222 0.259259 0.000 0.504673 0.353104
4 0.345794 0.261682 0.252336 0.140187 0.032258 0.322581 0.161290 0.258065 0.225806 0.000 0.579439 0.443834
train, test = train_test_split(data, random_state=42)
output_name = 'mean_deg_Mg_pH10'

We're now going to train the QLattice on this problem. We will use the auto_run function to run an automatic simulation, and we will just do a cursory run to get indications we can learn from.

ql = feyn.connect_qlattice()
ql.reset(random_seed=42)
models = ql.auto_run(
    train,
    output_name=output_name,
    n_epochs=10
    )
Loss: 4.91E-03Epoch no. 10/10 - Tried 17742 models - Completed in 18s.mean_deg_Mg_pH10 linear: scale=0.374156 w=0.411561 bias=0.7560mean_deg..0outaddadd1U_percent linear: scale=4.367347 w=0.946583 bias=-0.6510U_percen..2nummultiplymultiply3gaussian2gaussian4pairs_rate linear: scale=3.566667 w=-0.710260 bias=0.9307pairs_ra..5numA_percent linear: scale=3.689655 w=-0.708713 bias=1.1135A_percen..6numU-G linear: scale=4.666667 w=0.189813 bias=1.7756U-G7num

The Model on display is the best out of all 20000 Models that the QLattice has tried. We take a closer look with the plot_signal function to find the metrics of this model and if there are some redundancies.

best = models[0]
best.plot_signal(train)
mean_deg_Mg_pH10 linear: scale=0.374156 w=0.411561 bias=0.7560mean_deg..0outaddadd1U_percent linear: scale=4.367347 w=0.946583 bias=-0.6510U_percen..2nummultiplymultiply3gaussian2gaussian4pairs_rate linear: scale=3.566667 w=-0.710260 bias=0.9307pairs_ra..5numA_percent linear: scale=3.689655 w=-0.708713 bias=1.1135A_percen..6numU-G linear: scale=4.666667 w=0.189813 bias=1.7756U-G7num0.70.70.380.610.6-0.030.06-0.11-10+1Pearson correlation

This plot displays the correlation of each step of the model. Reading from left to right, you can see how interactions increase the likeness of the two distributions. Reading from right to left, you can see by the reduction in correlation, how each subtree contributes.

Here we can see that there's a lot of signal that is captured in the gaussian interaction between A_percent and pairs_rate. There's also a signal increase, by adding U_percent, however the U-G does not seem to be adding anything extra to the Model at all.

Based on these observations, we will do another run but this time we will try to reduce the amount of redundancies by reducing the max_complexity to 5.

Then we will let the model above compete with the new simulate runs by passing it into the starting_models parameter. We will let this train for a little longer, to allow it to search the space and give it time to fit the models to completion.

models = ql.auto_run(
    train,
    output_name=output_name,
    max_complexity=5,
    n_epochs=20,
    starting_models=models    
    )
Loss: 4.74E-03Epoch no. 20/20 - Tried 26412 models - Completed in 32s.mean_deg_Mg_pH10 linear: scale=0.374156 w=0.435117 bias=0.5531mean_deg..0outmultiplymultiply1gaussian2gaussian2A_percent linear: scale=3.689655 w=0.621219 bias=-0.9528A_percen..3numpairs_rate linear: scale=3.566667 w=-0.584967 bias=0.7454pairs_ra..4numU_percent linear: scale=4.367347 w=1.309515 bias=1.2263U_percen..5num
best = models[0]
best.plot(train,test)
mean_deg_Mg_pH10 linear: scale=0.374156 w=0.435117 bias=0.5531mean_deg..0outmultiplymultiply1gaussian2gaussian2A_percent linear: scale=3.689655 w=0.621219 bias=-0.9528A_percen..3numpairs_rate linear: scale=3.566667 w=-0.584967 bias=0.7454pairs_ra..4numU_percent linear: scale=4.367347 w=1.309515 bias=1.2263U_percen..5numTraining MetricsR20.504RMSE0.0689MAE0.0537Test0.4690.06720.0506InputsA_percentpairs_rateU_percent

Training Metrics

Test

This Model overtook the previous one. It is much slimmer and really captures the important features and interactions. Those are clearly the gaussian with the pairs_rate and A_percent and the multiplication function with U_percent.

So you can see that using this simple workflow gives you a lot of power in finding the features and transformations that really matter to the output. Hvaving this more allows you to find much more interpretable model when we compare to other machine learning methods.

← Classifying toxicity of antisense oligonucleotides

Subscribe to get news about Feyn and the QLattice.

You can opt out at any time, and you can read our privacy policy here.

Copyright © 2024 Abzu.ai - Feyn license: CC BY-NC-ND 4.0
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®