New research warns of potential ‘collapse’ of machine learning models
Posted: 25th July 2024
A new study by a research group headed by Associate Professor Yarin Gal published in Nature warns of a major roadblock for machine learning models of the future, where errors build over time and ultimately lead to a phenomenon that the researchers term ‘model collapse’.
Machine learning models such as Google’s Gemini and OpenAI’s ChatGPT have paved the way for a wide range of applications and tools that have revolutionised daily tasks, from generating poetry to helping to draft emails. The rapidly advancing performance of these models is generally attributed to efficient hardware and high-quality data. Yet the study found that, while hardware is improving, the same could not be said of data. The widespread use of machine learning models has seen an increasing amount of AI data produced without human oversight, fundamentally altering the way the models learn.
Led by Dr Ilia Shumailov, Senior Research Fellow in Associate Professor Gal’s Oxford Applied and Theoretical Machine Learning (OATML) group, the study found that training AI models using data generated by previous models leads to long-term learning problems. The models degrade in quality and can ultimately fail when trained on recursively generated data, i.e. when they ingest data that they produced themselves. The study attributes this to the build-up of minor errors and misconceptions over time, inherent to machine learning models. These are then reproduced by subsequently trained models, which in turn add slight errors of their own. Over time, these can cause the models to collapse.
Model collapse is the AI equivalent of a feedback loop gone wrong. The more models feed on their own output, the further they drift from reality. Model collapse threatens to create an AI echo chamber. Associate Professor Yarin Gal
The study carries significant impact for the future of these rapidly evolving and widely used AI models, raising questions about their robustness and efficacy. As such, the study highlights the importance of maintaining access to original, human-created data for future machine learning model development to mitigate against collapse. It also emphasises the need for the attribution and provenance of data, even though it is increasingly difficult to distinguish real data from LLM-generated online content.
Co-authors of the study, entitled AI models collapse when trained on recursively generated data, include Associate Professor Gal at the Department of Computer Science, Zakhar Shumaylov and Professor Ross Anderson at the University of Cambridge, Yiren Zhao at Imperial College London, and Nicolas Papernot, Associate Professor at the University of Toronto.
Read the full study at: https://www.nature.com/articles/s41586-024-07566-y