Home  | Publications | GWZ+25

Language May Be All Omics Needs: Harmonizing Multimodal Data for Omics Understanding With CellHermes

MCML Authors

Abstract

Decoding cellular systems requires integrating diverse omics data, yet most models are trained from scratch on a single modality, restricting generalization. Here we present CellHermes, a biological language model that leverages pretrained large language models (LLMs) to integrate multimodal forms of omics data, such as transcriptomic profile and PPI network, through natural language for better omics understanding. By reformulating these datasets into question-answer pairs, CellHermes emulates multiple self-supervised learning paradigms within a single, universal, language-based framework with LORA fine-tuning on existing natural language-LLM, achieving comparable or even better performance compared to current single-cell foundation models trained from scratch. Within this framework, CellHermes functions as an encoder, a predictor and an explainer for supporting a range of downstream tasks. For encoder, CellHermes encodes biological entities with embeddings that enable accurate network discovery and cross-dataset generalization; For predictor, CellHermes unifies heterogeneous downstream tasks by translating them into natural language question-answers pairs, allowing a single model to perform multi-task prediction by instruction fine-tuning; For explainer, CellHermes explains molecular mechanisms by combining attention analysis with text-based reasoning, leveraging the interpretability and reasoning capabilities of LLMs. We evaluated the ability of CellHermes as an encoder on representing genes and cells using 5 gene-level downstream tasks and 5 diverse single-cell datasets across different tissues compared with other single cell foundation models. We also evaluated CellHermes as a predictor on BioUniBench, a new benchmark of 10 tasks across 7 databases for LLMs, including perturbation response, cell fitness estimation, gene-disease association, and cell type identification. All benchmarks suggest CellHermes can unify multiple biological tasks within one model without compromising performance. By applying CellHermes as an explainer for a melanoma patients dataset, it can uncover potential key genes on tumor-reactive T cells. Together, CellHermes establishes natural language as a unifying medium for omics, offering a foundation that, in the future, potentially enables a more integrated and interpretable research loop of biological representation, prediction, and interpretability.

misc GWZ+25


Preprint

Nov. 2025

Authors

Y. Gao • W. Wang • Y. Zhao • K. Dong • C. Shan • W. Zheng • T. Richter • Z. Li • S. Chen • F. J. Theis • Q. Liu

Links

DOI

Research Area

 C2 | Biology

BibTeXKey: GWZ+25

Back to Top