Skip to main content

Enhancing Worst-Case Safety in Large Language Models through Influence Functions and Backdoor Detection

Supervisors

Suitable for

MSc in Advanced Computer Science
Mathematics and Computer Science, Part C
Computer Science and Philosophy, Part C
Computer Science, Part C
Computer Science, Part B

Abstract

Large language models have advanced in various fields, but concerns about their safety in worst-case scenarios persist [1]. Interpretability methods are used to understand these models' decisions but have limitations — they mainly focus on average cases and lack ground truth [2]. This proposal aims to overcome these limitations by injecting backdoors into language models to provide ground truth and developing an efficient influence function-based method for backdoor detection. In doing so, we will leverage exciting new progress allowing influence functions to be scaled to large language models [3]. This project is designed to lead to publication. We are looking for a highly-motivated student. We will be able to collaborate with Dr. Ameya Prabhu (Tuebingen University) on this project.

[1] Carlini et al., Are aligned neural networks adversarially aligned?. June 2023. http://arxiv.org/abs/2306.15447. arXiv:2306.15447 [cs]

[2] Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability, October 2023. http://arxiv.org/abs/2304.14997. arXiv:2304.14997 [cs]

[3] Grosse et al., Studying Large Language Model Generalization with Influence Functions, https://arxiv.org/abs/2308.03296, arXiv:2308.03296 [cs.LG]