Detecting Liver Cancer (HCC) in Plasma

by: Samuel Demharter & Valdemar Stentoft-Hansen

Feyn version: 2.1+

Last updated: 23/09/2021

How well does the QLattice deal with highly correlated features in omics data?

In this tutorial we will be exploring this question with a problem and a dataset from the field of liver cancer diagnosis. Namely, classifying Hepatocellular Carcinoma patients using methylation biomarkers as features. This specific example is taken from a study by Wen, Li, Guo, Liu et al. and contains data generated by methylated CpG tandems amplification and sequencing (MCTA-Seq), a powerful method that can detect thousands of hypermethylated CpG islands (CGIs) simultaneously in circulating cell-free DNA (ccfDNA).

The original dataset contains thousands of features (CGIs). However, as you will see the model will achieve very high performance with only a handful of them.

import numpy as np
import pandas as pd
import feyn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns

Load the data

Note, the data has been preprocessed and only includes features with high variance. The final curated dataset contains 1712 CGI features with a binary target of 55 cancer-free (0) and 36 cancer (1) individuals. The CGI-features cover the methylated alleles per million mapped reads.

data = pd.read_csv("../data/cancer_hcc.csv")

data.head()

	chr10_102882977_102883551	chr10_102982302_102983600	chr10_105344173_105345039	chr10_106028542_106029047	chr10_112835990_112839303	chr10_11386302_11386875	chr10_115803358_115805468	chr10_11784407_11784937	chr10_11911554_11912432	chr10_124907283_124911035	...	chrX_99661653_99665358	chrX_9982513_9984583	chrY_10032933_10034686	chrY_13470631_13471733	chrY_13488004_13489328	chrY_13585173_13586108	chrY_14100002_14100593	chrY_15863345_15864834	chrY_28555535_28555932	target
0	26	37	56	68	123	316	46	14	306	53	...	25	0	690	184	210	176	169	187	44	1
1	90	42	51	113	70	174	50	5	146	53	...	15	0	346	166	82	162	140	52	15	1
2	10	32	34	4	192	242	29	73	103	76	...	221	95	382	92	260	0	0	0	0	1
3	15	30	75	32	234	342	36	101	373	41	...	15	0	421	291	292	195	162	298	53	1
4	54	105	107	151	199	252	129	136	310	34	...	103	1	488	312	282	296	262	188	46	1

5 rows × 1713 columns

Split dataset into train and test set

random_seed = 42
train, test = train_test_split(data, test_size = 0.33, stratify=data["target"], random_state=random_seed)

Train the QLattice

ql = feyn.connect_qlattice()
ql.reset(random_seed = 42)

models = ql.auto_run(
    data=train,
    output_name='target',
    kind='classification',
    criterion='bic',
    max_complexity=4
    )

From here we choose the best fitting graph on the training set as per the bayesian information criterion stated in the ql.auto_run()-call. The bayesian information criterion is used as a balancing measure between fitness and complexity, penalising more complex models. We are looking for simple models because of their tendency to generalise better to unseen data – following the Occam's razor principles.

# Choose the best model
best = models[0]

Evaluating graph performance

best.plot(train, test)

Training Metrics

Test

This two-feature model is the top performer and delivers an almost perfect AUC-score on our holdout-set. Notice that in this process we are boiling 1712 features down to two, drastically removing redundant variance.

In our terminology, model is just another word for mathematical formula. We can print our model as a mathematical function with the sympify call as shown below.

best.sympify(signif = 2)

$\displaystyle \operatorname{logreg}{\left(0.17 chr_{175947306059483266} - 0.063 chr_{97213122172132174} + 7.5 \right)}$

Why would our graph pick up these two features?

Let's have a look - the seaborn pairplot is a useful tool for this. The pairplot shows the density and scatter plots revealing the relationships between the features.

# Filter only graph features and target
features_data = train[best.features + ["target"]]

# Pairplot with target coloring
sns.pairplot(features_data, hue = 'target')

<seaborn.axisgrid.PairGrid at 0x7ff85a06f220>

png

The chromosome 17 feature ('chr17...') is the primary separator out of our two features. Non-cancer individuals have a stable low level values methylated alleles ppm whereas cancer individuals generally have higher, more varying levels of this feature. However, this is not the whole story. We observe that some cancer individuals have low levels of 'chr17...'. To identify cancer individuals with low level of 'chr17...' our model's second feature 'chr9...' offers a solution: There is tendency that lower levels of 'chr9...' identify cancer individuals. This dynamic is captured well in the 2d partial plot shown below:

# Plot a 2-dimensional partial plot
best.plot_response_2d(train)

png

The partial plot displays the decision boundary determined by our model. If you were to draw the decision boundary yourself it would not look much different from what our model has produced. This is a fine sanity check of your model.

Exploring the multicollinearity

There might be competing models with similar performance metrics – especially in a case like this, where there are many features with potentially high correlation between them. This phenomenon is also known as multicollinearity.

In the following code we will plot a correlation heatmap of the data in an attempt to understand how the QLattice is finding high performance with only two features.

# Take a random subset of 100 features
data_wo_target = data.drop('target', axis = 1)
sampled_features = np.unique(list(np.random.choice(data_wo_target.columns, 100, replace=False)) + best.features)
sample_data = data[sampled_features]

# Prepare for labelling the two model features in the heatmap
label_feature = list()

for x, i in enumerate(sampled_features):
    if i in best.features:
        label_feature.append(1)
    else: label_feature.append(0)

# Assign colors to chosen features
lut = dict({0: 'w',
           1: 'b'})

# Map colors to correlation data
row_coloring = pd.Series(label_feature, index = sample_data.corr().index).map(lut)

p = sns.clustermap(sample_data.corr(), method="ward", cmap='feyn-diverging', row_colors = row_coloring,
               vmin=-1, vmax=1, figsize=(20,15))

png

Above we are displaying pairwise correlations between a random subset of 100 features including the two model features (blue bars on the left). The heavy green colors indicate high positive correlation between the features; multicollinearity confirmed. The plot clusters the features based on similarity – in this case using pearson correlation – and sorts the features accordingly. We find two main groups of linear feature variance as displayed by the top branches in the dendrogram. You could suspect the QLattice digging for signal in the available variance across these groups, and it turns out that QLattice does exactly that by picking one feature from each main group.

Conclusion

In this tutorial we have gone through the QLattice workflow for exploring the relationships in our omics data. In 10 minutes of modelling time we have gone from 1712 features to pinpointing two features that in conjunction provide a competetive and interpretable model.

We have made use of a range of in-built plotting functions that have helped in interpreting our model and the relationships between features. We are now in the position to build on our findings to further improve our working theory and ultimately our predictions.

At Abzu we are exploring similar use cases to help researchers understand the underlying mechanisms in their omics data. If you are dealing with similar challenges why not reach out and/or sign up for a QLattice trial?

Feyn Documentation

Classification

Regression

Regression

Classification

Regression

Detecting Liver Cancer (HCC) in Plasma

Load the data

Split dataset into train and test set

Train the QLattice

Evaluating graph performance

Training Metrics

Test

Why would our graph pick up these two features?

Exploring the multicollinearity

Conclusion

Subscribe to get news about Feyn and the QLattice.

	chr10_102882977_102883551	chr10_102982302_102983600	chr10_105344173_105345039	chr10_106028542_106029047	chr10_112835990_112839303	chr10_11386302_11386875	chr10_115803358_115805468	chr10_11784407_11784937	chr10_11911554_11912432	chr10_124907283_124911035	...	chrX_99661653_99665358	chrX_9982513_9984583	chrY_10032933_10034686	chrY_13470631_13471733	chrY_13488004_13489328	chrY_13585173_13586108	chrY_14100002_14100593	chrY_15863345_15864834	chrY_28555535_28555932	target
0	26	37	56	68	123	316	46	14	306	53	...	25	0	690	184	210	176	169	187	44	1
1	90	42	51	113	70	174	50	5	146	53	...	15	0	346	166	82	162	140	52	15	1
2	10	32	34	4	192	242	29	73	103	76	...	221	95	382	92	260	0	0	0	0	1
3	15	30	75	32	234	342	36	101	373	41	...	15	0	421	291	292	195	162	298	53	1
4	54	105	107	151	199	252	129	136	310	34	...	103	1	488	312	282	296	262	188	46	1

	chr10_102882977_102883551	chr10_102982302_102983600	chr10_105344173_105345039	chr10_106028542_106029047	chr10_112835990_112839303	chr10_11386302_11386875	chr10_115803358_115805468	chr10_11784407_11784937	chr10_11911554_11912432	chr10_124907283_124911035	...	chrX_99661653_99665358	chrX_9982513_9984583	chrY_10032933_10034686	chrY_13470631_13471733	chrY_13488004_13489328	chrY_13585173_13586108	chrY_14100002_14100593	chrY_15863345_15864834	chrY_28555535_28555932	target
0	26	37	56	68	123	316	46	14	306	53	...	25	0	690	184	210	176	169	187	44	1
1	90	42	51	113	70	174	50	5	146	53	...	15	0	346	166	82	162	140	52	15	1
2	10	32	34	4	192	242	29	73	103	76	...	221	95	382	92	260	0	0	0	0	1
3	15	30	75	32	234	342	36	101	373	41	...	15	0	421	291	292	195	162	298	53	1
4	54	105	107	151	199	252	129	136	310	34	...	103	1	488	312	282	296	262	188	46	1

	chr10_102882977_102883551	chr10_102982302_102983600	chr10_105344173_105345039	chr10_106028542_106029047	chr10_112835990_112839303	chr10_11386302_11386875	chr10_115803358_115805468	chr10_11784407_11784937	chr10_11911554_11912432	chr10_124907283_124911035	...	chrX_99661653_99665358	chrX_9982513_9984583	chrY_10032933_10034686	chrY_13470631_13471733	chrY_13488004_13489328	chrY_13585173_13586108	chrY_14100002_14100593	chrY_15863345_15864834	chrY_28555535_28555932	target
0	26	37	56	68	123	316	46	14	306	53	...	25	0	690	184	210	176	169	187	44	1
1	90	42	51	113	70	174	50	5	146	53	...	15	0	346	166	82	162	140	52	15	1
2	10	32	34	4	192	242	29	73	103	76	...	221	95	382	92	260	0	0	0	0	1
3	15	30	75	32	234	342	36	101	373	41	...	15	0	421	291	292	195	162	298	53	1
4	54	105	107	151	199	252	129	136	310	34	...	103	1	488	312	282	296	262	188	46	1