Feyn

Feyn

  • Tutorials
  • Guides
  • API Reference
  • FAQ

›Classification

Beginner

    Classification

    • Titanic survival
    • Pulsar stars
    • Poisonous Mushrooms

    Regression

    • Airbnb prices
    • Automobile MPG
    • Concrete strength

Advanced

    Regression

    • Wine Quality

Use cases

  • Rewriting models with correlated inputs
  • Complexity-Loss Trade-Off
  • Plotting the loss graph
  • Simple linear and logistic regression
  • Deploy a model for inference

Life Sciences

    Classification

    • Finding AD biomarkers in proteomics data
    • Detecting Liver Cancer (HCC) in Plasma
    • Classifying toxicity of antisense oligonucleotides

    Regression

    • Covid-19 RNA vaccine degradation data set
    • Preventing the Honeybee Apocalypse (QSAR)

Interfacing with R

  • Classifying toxicity of antisense oligonucleotides

Archive

  • Covid-19 vaccination RNA dataset.

Finding AD biomarkers in proteomics data

by: Samuel Demharter and Meera Machado

Feyn version: 2.1+

Last updated: 23/09/2021

Can the QLattice deal with omics data that is noisy and contains thousands of features? It certainly can!

Omics data typically contains hundreds to thousands of features (proteins, transcripts, methylated DNA etc.) that are measured in samples derived from sources such as blood, tissue or cell culture. These types of approaches are often used for exploratory analysis e.g. in biomarker discovery or understanding the mechanism of action of a certain drug. It often resembles a bit of a "fishing exercise".

Thus, there is a need to quickly and reliably identify the most important features and their interactions that contribute to a certain signal (e.g. disease state, cell-type identity, cancer detection).

In this tutorial we present a brief workflow for building simple and interpretable models for proteomics data. This specific example is taken from a study by Bader & Geyer et al. 2020 (Mann group) and contains samples taken from the cerebrospinal fluid of Alzheimer Disease (AD) patients and non-AD patients. We will show you how to build QLattice model that can classify people into AD and non-AD according to their proteomic profiles.

The dataset contains over a thousand features (features in this example describe the intensity of different proteins measured by mass spectrometry).

import numpy as np
import pandas as pd
import feyn

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Load the data

Note, the data has been preprocessed and missing values have been imputed. It contains 1166 proteins and 88 non-AD and 49 AD subjects.

data = pd.read_csv("../data/ad_omics.csv")

# Let's record the categorical data types in our dataset (note features will be treated as numerical by default).
stypes = {}
for f in data.columns:
    if data[f].dtype =='object':
        stypes[f] = 'c'

Split dataset into train and test set

# Set random seed for reproducibility
random_seed = 42

# Define the target variable
target = "_clinical AD diagnosis"

# Split
train, test = train_test_split(data, test_size=0.33, stratify=data[target], random_state=random_seed)

Train the QLattice

ql = feyn.connect_qlattice()
ql.reset(random_seed=random_seed)
models = ql.auto_run(
    data=train,
    output_name=target,
    kind='classification',
    stypes=stypes,
    n_epochs=20
    )
Loss: 1.19E-01Epoch no. 20/20 - Tried 43648 models - Completed in 1m 45s._clinical AD diagnosis logistic: w=4.9938 bias=-0.9558_clinica..0outaddadd1MAPT linear: scale=0.000038 w=2.596965 bias=-0.3944MAPT2nummultiplymultiply3LILRA2 linear: scale=0.000067 w=0.967831 bias=0.6829LILRA24numaddadd5ENOPH1 linear: scale=0.000068 w=1.062997 bias=-1.6156ENOPH16numAP2B1 linear: scale=0.000006 w=0.181887 bias=-1.0588AP2B17num

Inspect the top model

best = models[0]
best.plot(train, test)
_clinical AD diagnosis logistic: w=4.9938 bias=-0.9558_clinica..0outaddadd1MAPT linear: scale=0.000038 w=2.596965 bias=-0.3944MAPT2nummultiplymultiply3LILRA2 linear: scale=0.000067 w=0.967831 bias=0.6829LILRA24numaddadd5ENOPH1 linear: scale=0.000068 w=1.062997 bias=-1.6156ENOPH16numAP2B1 linear: scale=0.000006 w=0.181887 bias=-1.0588AP2B17numTraining MetricsAccuracy0.956AUC0.993Precision0.939Recall0.939Test0.9350.960.8421.0InputsMAPTLILRA2ENOPH1AP2B1

Training Metrics

Test

best.plot_signal(train)
_clinical AD diagnosis logistic: w=4.9938 bias=-0.9558_clinica..0outaddadd1MAPT linear: scale=0.000038 w=2.596965 bias=-0.3944MAPT2nummultiplymultiply3LILRA2 linear: scale=0.000067 w=0.967831 bias=0.6829LILRA24numaddadd5ENOPH1 linear: scale=0.000068 w=1.062997 bias=-1.6156ENOPH16numAP2B1 linear: scale=0.000006 w=0.181887 bias=-1.0588AP2B17num0.930.790.760.5-0.270.460.420.29-10+1Pearson correlation

As expected, MAPT (i.e. Tau) seems to be driving most of the signal here. Let's investigate further.

Explore features

Let's look at how the different features play together.

show_quantiles = 'ENOPH1'
fixed = {}
fixed[show_quantiles] = [
    train[show_quantiles].quantile(q=0.25),
    train[show_quantiles].quantile(q=0.5),
    train[show_quantiles].quantile(q=0.75)
]

best.plot_response_1d(train, by = "MAPT", input_constraints=fixed)

png

This response plot shows you how higher ENOPH1 levels shift the MAPT curve to the left. I.e. the higher your ENOPH1 levels, the lower your MAPT levels have to be for a positive AD prediction.

There is much more to be explored. Watch this space and get in touch if you'd like to give it a shot.

← Deploy a model for inferenceDetecting Liver Cancer (HCC) in Plasma →
  • Load the data
  • Split dataset into train and test set
  • Train the QLattice
  • Inspect the top model
  • Explore features
Copyright © 2022 Abzu.ai
Feyn®, QGraph®, and the QLattice® are registered trademarks of Abzu®