Exploring the Language of Microbial Genomes
- 14:00 23rd May 2025 ( week 4, Trinity Term 2025 )(this is a virtual seminar)
Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. The recent revolution in natural language processing (NLP) has led to the application of language models on various types of biological data to address different tasks. In this talk, I will discuss a new concept for modeling microbial genomics using learning methodologies adapted from NLP. As tokens, we use complete genes to model their "semantics" based on a biological corpus of more than 360 million microbial genes within their genomic context. We demonstrate our model's ability to infer function correctly. We then systematically evaluate the "discovery potential" of different functional categories, pinpointing those with the most genes yet to be characterized. Our approach has revealed systems associated with microbial interaction and defense. Overall, our methodology highlights the merit of modeling microbial genomes at different levels of abstraction, uncovering new gene functions in microbes