Home  | Publications | HTB+22

Transformer Model for Genome Sequence Analysis

MCML Authors

Abstract

One major challenge of applying machine learning in genomics is the scarcity of labeled data, which often requires expensive and time-consuming physical experimentation under laboratory conditions to obtain. However, the advent of high throughput sequencing has made large quantities of unlabeled genome data available. This can be used to apply semi-supervised learning methods through representation learning. In this paper, we investigate the impact of a popular and well-established language model, namely BERT [Devlin et al., 2018], for sequence genome analysis. Specifically, we adapt DNABERT [Ji et al., 2021] to GenomeNet-BERT in order to produce useful representations for downstream tasks such as classification and semi10 supervised learning. We explore different pretraining setups and compare their performance on a virus genome classification task to strictly supervised training and baselines on different training set size setups. The conducted experiments show that this architecture provides an increase in performance compared to existing methods at the cost of more resource-intensive training.

inproceedings


LMRL @NeurIPS 2022

Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022.

Authors

N. Hurmer • X.-Y. ToM. BinderH. A. Gündüz • P. C. Münch • R. Mreches • A. C. McHardy • B. BischlM. Rezaei

Links

URL

Research Area

 A1 | Statistical Foundations & Explainability

BibTeXKey: HTB+22

Back to Top