Discrete diffusion for biomolecules
PhD Projects in Artificial Intelligence

Project Summary
Diffusion models have become an established approach for modelling natural images, audio and video, while they have enabled significant advances in generative biology, particularly in protein design and docking. The application of diffusion models on sequences with long-range dependencies, as they emerge in natural language, poses new challenges primarily due to limitations of diffusion models pertaining to training efficiency and inference speed. These are properties that are essential for natural language generation.
Recently, discrete diffusion models, like Masked Diffusion, have demonstrated promising results in language generation while other efforts have accomplished performance parity with autoregressive models in reasoning tasks without compromising inference speed.
Diffusion models have significant benefits that make them suitable for modelling biomolecular sequences, including the ability to predict multiple tokens simultaneously, leverage bidirectional context during inference and revise model predictions, i.e. iteratively refine the predicted tokens across diffusion steps. However, they remain under explored for sequence modelling in the presence of long-range dependencies with several outstanding challenges currently limiting their applicability. In this project, the student will explore data efficient and scalable architectures for discrete diffusion models, with a focus on biomolecular modelling that requires capturing long-range dependencies (e.g. in the order of millions of nucleotides). The project will explore hierarchical approaches as well as different ways to incorporate conditioning information into discrete diffusion models.
Potential Supervisors
- Dr Ira Ktena (Research Scientist, EIT)
- Dr Ben Chamberlain (Research Scientist, EIT)
- Dr Liam Atkinson (Research Engineer, EIT)
- Additional Supervisor(s) from the University of Oxford
Skills Recommended
- Strong background in linear algebra, probability and optimisation
- Experience with probabilistic modelling or generative models
- Experience with deep learning (PyTorch or JAX)
Skills to be Developed
- Designing novel diffusion architectures
- Performing analyses of model scaling and data-efficiency
- Domain knowledge in biology
- Developing large scale multi-GPU training and inference codebases
- Communicating results via open-source releases
University DPhil Courses
Relevant Background Reading
- Large Language Diffusion Models
- Simple and Effective Masked Diffusion Language Models
- Diffusion-LM Improves Controllable Text Generation
- Scalable Diffusion Models with Transformers
- Fast Training of Diffusion Models with Masked Transformers
- Classifier-Free Diffusion Guidance