AI Safety and Synthetic Data
Supervisors
Suitable for
Abstract
Synthetic data refers to artificially or algorithmically generated data, as opposed to real data that is generated by real-world events. This project will explore opportunities and risks of synthetic data, focusing on the safety of AI systems that use synthetic data in development or evaluation processes. We are interested in a range of features (reasoning, alignment, factuality, etc) and a range of application domains (WWW, social sciences/economics, health, finance, education, and other critical sectors). The student will perform empirical research by collecting novel data, conducting experiments to understand the effect of synthetic data, proposing (and evaluating) solutions to address the unintended effects and weaknesses of synthetic data. Interested students are welcome to contact Naman Goel to discuss or propose their own ideas related to above (naman.goel@cs.ox.ac.uk).Prerequisites: Good understanding and hands-on experience with machine learning, mathematical maturity, proficiency in Python. Experience with large language models (huggingface, running local models, using APIs of closed models, etc) or ability to learn quickly. Understanding of potential real-world harms will be a big plus.
References:
1. Liu, Ruibo, et al. "Best practices and lessons learned on synthetic data." First Conference on Language Modeling. 2024. https://openreview.net/pdf?id=OJaWBhh61C