DECONVersation: Single-Cell Foundation Model-Derived Embeddings for Robust Cell Type Deconvolution of Bulk RNA-seq Data
Understanding the composition of the tumor microenvironment is critical for cancer research, as a single biopsy contains a complex mixture of malignant cells, immune cells, and structural tissues. To estimate these cell-type proportions from standard bulk RNA sequencing data, researchers use a process called deconvolution. This computational technique effectively unmixed the genetic signal to infer the abundance of each cell type. However, traditional approaches are often hindered by technical noise, significant inter-patient variability, and biases introduced by manually selecting and weighting marker genes.
Here, we introduce DECONVersation, which leverages advanced single-cell foundation models such as Geneformer and Cell2Sentence to address these challenges. By utilizing AI-derived embeddings, this method captures complex gene relationships while reducing the impact of experimental noise that typically confounds traditional deconvolution tools. It also removes the need for manual curation of gene lists. Benchmarking against established methods such as MuSiC and BayesPrism demonstrates that this AI-based approach achieves comparable or superior accuracy across diverse datasets. Notably, the models generalize well to tissues not included in pre-training. Furthermore, we show that tissue-specific fine-tuning can further improve performance.
As the field continues to evaluate the utility of large-scale AI models in biology, deconvolution can serve as a valuable benchmarking task that captures both gene network learning and batch integration.