Utilizing deep learning in predicting gene expression

Published:

Code Read more about it here

Deep Learning in Gene Expression

The characterization of gene expession patterns of cells in various biological states has always been seen as a fundamental problem in molecular biology, where gene expression profiling has been fundamentally adopted as a way to capture gene expression patterns in cellular response to diseases, genetic perturbations and drug treatments.

This project was motivated by the CMap project, LINCS program as well as work presented by Chen, et. al.

From the CMap and LINCS program, it was observed that most expression profiles relating to the large number of genes (~22,000) across the whole human genome was known to be highly correlated - specifically, set of ~1,000 carefully chosen genes that could approximately capature 80% of inforrmation in CMap data.
The L1000 Luminex bead technology thus seeks to measure the expression profiles of these ~1,000 genes, called landmark genes .

LINCS program currently utilizes linear regression as an inference method, which trains regression models indepedently for individual target genes. Linear regression inevitably disregards the nonlinearities present within gene expression profiles. We thus saw a need to improve the current baselines and methods in predicting gene expression.

Data

We utilized 3 datasets:

  1. Gene Expression Omnibus (GEO)
  2. Geenotype-Tissue Expression (GTEx)
  3. 1000 Genomes expression data

Intepreting the results

While it is conclusive that the performance illustrated by the deep learning model utilized clearly outperforms linear regression, the intepretability of neural networks on gene expression still poses uncertainty. Interpreting a linear learnt model like linear regression proves to be more straightforward, as higher coefficients indicate stronger correlations between landmark and target genes.

The biggest drawback we experienced was a lack of computational power. I utilized Google Cloud Computing for data preproceessing and took to using a smaller section of the dataset to generate accuracy values similar to that illustrated by D-GEX in the paper. However, current deep learning algorithms and its use in predicting gene expression still generates an imputation error of approximately 33%.