Among all medical data, the genetic information plays a very specific role. It is a text - digital by nature. It is almost perfectly measured and it is cheap to obtain. The inherited genome, or germline genome, is a constant throughout the life of an individual. It can be cause but not consequence of diseases. Causal predictions can be experimentally validated by genome edition technologies. Furthermore, the genome encodes the molecules of life (RNAs and proteins), offering a valuable entry point to understand cellular mechanisms and to design pharmaceutical interventions. So far however, much of the human genetic sequence is not understood.
Our goal is an improved understanding of the genetic basis of gene regulation and its implication in diseases. To this end, we employ statistical modeling of 'omic data and work in close collaboration with experimentalists.
Quantitative modeling of gene regulation
The control of gene expression, i.e. how much of a gene product (RNA or protein) is available in the cell, is essential to cell biology. However, no computational model is yet able to predict gene expression levels in a given cell type from genomic sequence. Consequently, most of the variants associated with common diseases, which are non-coding, remain poorly interpreted. Also, no genetic diagnosis can be provided for the majority of patients with rare disease that show no obvious disease-causing coding variant.
In line with this need, we are developing approaches to delineate the quantitative effects of any genetic variation on gene expression. Our studies not only help in deciphering the genetic regulatory code (Eser, Wachutka et al., MSB, 2016, Cheng et al., RNA, 2017) but also in understanding gene regulatory mechanisms, such as the role of regulatory feedbacks (Bader et al., MSB 2015). At the heart of this endeavour lie the development of machine learning algorithms (Avsec et al., Bioinformatics, 2017) and tight interactions with experimentalists to design novel experimental approaches, notably with Patrick Cramer (Schwalb et al., Science, 2016, press release).
Systems genetics: DNA, molecular profiles, and diseases
We develop approaches to pinpoint molecular causes of diseases by integrating ‘multi-omics’ data (genomic, transcriptomic, proteomic, ...). In this context, we have shown how exploiting environmental perturbations greatly helps in delineating causation from correlative associations (Gagneur, Stegle et al., PLoS Genet., 2013).
Together with the Prokisch group at the institute of human genetics in Munich, we have pioneered an effective approach in which we sequence not only DNA, but also RNA, of patients suspected to be affected with a genetic disorder (Kremer, Bader et al., Nat. Commun., 2017, press release). This led to to a boost in diagnosis, pinpointing the disease-causing gene for by now 15% of unsolved cases. We are developing this approach further, using improved algorithms, larger datasets and more data modalities (proteomics, phenotypic descriptors, etc.).
Statistical methods and software
Our research leads to the development of statistical methods and software. We implement our methods in well-documented and open-source scientific software (R/Bioconductor or python). Examples include model-based gene set analysis (MGSA), a Bayesian approach to gene set enrichment analysis (Bauer et al., Bioinformatics, 2011), GenoGAM, a flexible framework for testing ChIP-seq differential occupancies in experimental factorial designs (Stricker et al., Bioinformatics, 2017), and STAN, a hidden Markov model with emission probabilities adapted to model ChIP-seq data (negative binomial counts). STAN allowed us to entirely re-annotate, at an improved accuracy, promoters and enhancers in human for 127 cell types and tissues by integrating genome-wide chromatin marks maps from the ENCODE and Roadmap Epigenomics compendium (Zacher at al., PLoS One, 2017).