Skip to main content

AI Safety and Theoretical Computer Science

Scott Aaronson ( University of Texas, Austin & OpenAI )

Title: AI Safety and Theoretical Computer Science

Abstract: Progress on AI safety and alignment, like the current AI revolutionmore generally, has been almost entirely empirical.  In this talk, however,I'll survey a few areas where I think theoretical computer science cancontribute to AI safety, including:

- How can we robustly watermark the outputs of Large Language Models andother generative AI systems, to help identify academic cheating, deepfakes,and AI-enabled fraud?  I'll explain my proposal and its basic mathematical properties, as well as what remains to be done.

- Can one insert undetectable cryptographic backdoors into neural nets, for good or ill?  In what senses can those backdoors also be unremovable?  How robust are they against fine-tuning?

- Should we expect neural nets to be "generically" interpretable?  I'll discuss a beautiful formalization of that question due to Paul Christiano, along with some initial progress on it, and an unexpected connection to quantum computing.

 

 

Share this: