Enhancing Worst-Case Safety in Large Language Models through Influence Functions and Backdoor Detection
Supervisors
Suitable for
Abstract
Large language models have advanced in various fields, but concerns about their safety in worst-case scenarios persist [1]. Interpretability methods are used to understand these models' decisions but have limitations — they mainly focus on average cases and lack ground truth [2]. This proposal aims to overcome these limitations by injecting backdoors into language models to provide ground truth and developing an efficient influence function-based method for backdoor detection. In doing so, we will leverage exciting new progress allowing influence functions to be scaled to large language models [3]. This project is designed to lead to publication. We are looking for a highly-motivated student. We will be able to collaborate with Dr. Ameya Prabhu (Tuebingen University) on this project.
[1] Carlini et al., Are aligned neural networks adversarially aligned?. June 2023. http://arxiv.org/abs/2306.15447. arXiv:2306.15447 [cs]
[2] Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability, October 2023. http://arxiv.org/abs/2304.14997. arXiv:2304.14997 [cs]
[3] Grosse et al., Studying Large Language Model Generalization with Influence Functions, https://arxiv.org/abs/2308.03296, arXiv:2308.03296 [cs.LG]