Principled Generalisation for Scientific ML

PhD Projects in Artificial Intelligence

Project Summary

Modern machine-learning systems often fail precisely where they would be most valuable to science: under distribution shift. New molecular scaffolds, unseen protein families, and altered experimental regimes routinely confound model predictions. This project will develop a principled framework to measure, improve, and validate out of distribution generalisation for scientific ML.

  1. Establish a shift-aware evaluation framework and benchmark suite for out of distribution generalisation in scientific ML.
  2. Develop architectural priors that encode domain symmetries and physical constraints, paired with targeted data augmentation, and counterfactual stress tests.
  3. Prospectively validate these methods on real world data, from two to three domains (e.g. protein-ligand binding, enzyme function prediction, molecular property prediction). Where appropriate, there is scope for this to include experimental validation at EIT.

The student will help define how the field measures and achieves useful scientific generalisation, with opportunities to stress test models on real data and (where feasible) close the loop with experimental validation at EIT.

Potential Supervisors

  • Dr Liam Atkinson (Research Engineer, EIT)
  • Dr Ira Ktena (Research Scientist, EIT)
  • Dr Ben Chamberlain (Research Scientist, EIT)
  • Additional Supervisor(s) from the University of Oxford.

Skills Recommended

  • A strong quantitative background
  • Proficiency in Python and scientific computing (NumPy/Pandas)
  • Solid experience with machine learning and deep learning (e.g., PyTorchor JAX)
  • Familiarity with at least one relevant scientific domain (e.g.,computational chemistry, structural biology, bioinformatics, cheminformatics) or a strong interest in learning it.

Skills to be Developed

  • Strong theoretical understanding of the limitations and guiding principle of machine learning
  • Designing novel architectures (e.g. geometric / structured modelling)
  • Domain knowledge (e.g. protein/ligand co-folding)
  • Benchmarking & evaluation for OOD data.

University DPhil Courses 

Relevant Background Reading