Skip to main content

Achieving Superalignment through Weak-to-Strong Generalization

Supervisors

Suitable for

MSc in Advanced Computer Science
Mathematics and Computer Science, Part C
Computer Science and Philosophy, Part C
Computer Science, Part C
Computer Science, Part B

Abstract

The capabilities of AI systems have significantly grown within a short amount of time. Further advances in synthetic data generation, self-play, and other techniques are set to improve their performance further, potentially resulting in superhuman capabilities. This poses an important safety question: But how can humans supervise future systems that are much smarter than themselves? Whereas current systems are aligned using human data, tasks that are too hard to solve for humans will require progress in the nascent field of superalignment [1]. In this project, we will conduct a critical examination of existing superalignment frameworks [2], and leverage latest work in semi-supervised learning and other fields in order to advance our understanding of superalignment. This project is designed to lead to publication. We are looking for a highly-motivated student.

For this project, we will be able to receive advice from Collin Burns (OpenAI).

[1] https://openai.com/blog/introducing-superalignment

[2] https://cdn.openai.com/papers/weak-to-strong-generalization.pdf