Bayesian Preference Learning for Active Learning and Robust Optimization in LLMs

Supervisors

Yarin Gal

Daniella Ye

Luckeciano Melo

Suitable for

MSc in Advanced Computer Science

Abstract

Prerequisites:

o Familiarity with large language models (LLMs)

o Knowledge of probabilistic machine learning

o Familiarity with Deep Bayesian methods and uncertainty quantification

o Programming experience in Python, PyTorch, and Hugging Face

Background

Preference learning is a fundamental problem in machine learning, with applications ranging from recommendation systems to reinforcement learning and AI alignment. Large language models (LLMs) increasingly rely on preference learning to refine their outputs via reinforcement learning from human feedback (RLHF). However, current methods often lack uncertainty quantification, leading to suboptimal learning and potentially biased models.

Bayesian approaches offer principled uncertainty estimation, enabling more robust decision-making. This project will explore Bayesian preference learning in LLMs, focusing on quantifying uncertainty in preference models and leveraging this uncertainty in active learning and robust policy optimization. Active learning can help prioritize queries that maximize information gain, improving data efficiency. Robust policy optimization ensures that learned policies generalize well under distributional shifts, leading to more reliable model behavior. Recent work, such as Melo et al. (2024) [1] on Bayesian active learning, provides a strong foundation for this research.

[1] Melo et. Al. Deep Bayesian Active Learning for Preference Modelling in Large Language Models. NeurIPS, 2024.

Focus

This project aims to investigate the following research questions:

• How can Bayesian methods improve uncertainty quantification in preference models for LLMs?

• How can uncertainty-aware preference models be used to enhance active learning strategies?

• How can Bayesian preference learning contribute to robust policy optimization in RLHF and other decisionmaking tasks?

• The expected contributions include a novel framework for Bayesian preference learning in LLMs, empirical analysis of its benefits, and potential applications in AI alignment, active learning, and robust policy optimization.

Method

To achieve these goals, the project will:

• Build on prior work from Melo et al. (2024) on Bayesian active learning.

• Utilize the act-pm library (internal) for active preference modeling.

• Implement Bayesian preference models for LLM fine-tuning.

• Develop active learning strategies that prioritize informative preference queries.

• Investigate robust policy optimization techniques to address reward overoptimization/hacking.

• Evaluate the framework on synthetic and real-world preference datasets.

• Compare Bayesian methods against standard preference learning baselines

Bayesian Preference Learning for Active Learning and Robust Optimization in LLMs

Supervisors

Suitable for

Abstract

Student Space