Finding AD biomarkers in proteomics data
by: Samuel Demharter and Meera Machado
Feyn version: 2.1+
Last updated: 23/09/2021
Can the QLattice deal with omics data that is noisy and contains thousands of features? It certainly can!
Omics data typically contains hundreds to thousands of features (proteins, transcripts, methylated DNA etc.) that are measured in samples derived from sources such as blood, tissue or cell culture. These types of approaches are often used for exploratory analysis e.g. in biomarker discovery or understanding the mechanism of action of a certain drug. It often resembles a bit of a "fishing exercise".
Thus, there is a need to quickly and reliably identify the most important features and their interactions that contribute to a certain signal (e.g. disease state, cell-type identity, cancer detection).
In this tutorial we present a brief workflow for building simple and interpretable models for proteomics data. This specific example is taken from a study by Bader & Geyer et al. 2020 (Mann group) and contains samples taken from the cerebrospinal fluid of Alzheimer Disease (AD) patients and non-AD patients. We will show you how to build QLattice
model that can classify people into AD and non-AD according to their proteomic profiles.
The dataset contains over a thousand features (features in this example describe the intensity of different proteins measured by mass spectrometry).
import numpy as np
import pandas as pd
import feyn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
Load the data
Note, the data has been preprocessed and missing values have been imputed. It contains 1166 proteins and 88 non-AD and 49 AD subjects.
data = pd.read_csv("../data/ad_omics.csv")
# Let's record the categorical data types in our dataset (note features will be treated as numerical by default).
stypes = {}
for f in data.columns:
if data[f].dtype =='object':
stypes[f] = 'c'
Split dataset into train and test set
# Set random seed for reproducibility
random_seed = 42
# Define the target variable
target = "_clinical AD diagnosis"
# Split
train, test = train_test_split(data, test_size=0.33, stratify=data[target], random_state=random_seed)
Train the QLattice
ql = feyn.connect_qlattice()
ql.reset(random_seed=random_seed)
models = ql.auto_run(
data=train,
output_name=target,
kind='classification',
stypes=stypes,
n_epochs=20
)
Inspect the top model
best = models[0]
best.plot(train, test)
Training Metrics
Test
best.plot_signal(train)
As expected, MAPT
(i.e. Tau) seems to be driving most of the signal here. Let's investigate further.
Explore features
Let's look at how the different features play together.
show_quantiles = 'ENOPH1'
fixed = {}
fixed[show_quantiles] = [
train[show_quantiles].quantile(q=0.25),
train[show_quantiles].quantile(q=0.5),
train[show_quantiles].quantile(q=0.75)
]
best.plot_response_1d(train, by = "MAPT", input_constraints=fixed)
This response plot shows you how higher ENOPH1
levels shift the MAPT
curve to the left. I.e. the higher your ENOPH1
levels, the lower your MAPT
levels have to be for a positive AD prediction.
There is much more to be explored. Watch this space and get in touch if you'd like to give it a shot.