Data Leakage in Large Language Models through Prompting
Supervisors
Suitable for
Abstract
Many online services rely on large language models fine-tuned with concealed system prompts detailing operational behavior. Adversaries may attempt to synthesize input to uncover these hidden prompts and manipulate the model, posing security risks. Instances of online games and competitions already exist, attempting to reveal hidden passwords in system prompts. The project is about formalizing this problem and developing systematic approaches to generating adversarial prompts that reveal vulnerabilities. This includes optimizing prompts to elicit affirmative responses indicative of data leakage. The goal is to identify minimal adversarial prompts that can bypass restrictions on prompt queries, enhancing security.