Skip to main content

DPhil the Future: Mathematical capabilities of ChatGPT - one year on

Posted:

A grey-green rectangle with dark blue text reading "DPhil the Future Our students are 100% part of our success. DPhil the Future is our way of giving our students a platform to share their insights and views on all things computer science.". On the right of the image there are faded grey-white lines and dots that look like a computer board.

Final-year doctoral student Frieder Simon shares his evolving research into the impact and performance of large language models on the first anniversary of ChatGPT.  
 
Since its inception, ChatGPT has had a major impact around the global, and in particular on various scientific domains. For mathematics, the paper "Mathematical Capabilities of ChatGPT" was the first and, to date, is still one of the most extensive evaluations of ChatGPT (and also GPT-4) on advanced mathematics.  

The paper, which the author together with an international team he assembled for the task, found that ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student.

Our results show that ChatGPT often understands the question but fails to provide correct solutions. DPhil student Frieder Simon
Evolution of AI research 

So how did this paper unfold and what new avenues of research connecting mathematics with AI have opened up since then? 

The initial research had impact, ranging from being included in a position paper of the Committee for Education, Research, and Technology Assessment by the German government, to being reported in various news outlets. It has been cited over 200 times to date and was one of the first papers, paving the way for subsequent studies on large language models (LLMs), which offer a broader analysis. 

Beyond simply assessing an ephemeral LLMs that will soon be replaced, the original research paper aimed to: 

1) outline how a better benchmarking system for mathematics looks like, and;  
2) how to structure a diverse dataset that measures LLM performance on a number of axes, going beyond standard mathematical difficulty. 
 
Research at pace 

In hindsight, the inception of the paper was due to lucky timing since the author had worked previously on integrating mathematics with AI, having double degrees (in pure mathematics and computer science),  which was the ideal backdrop to start writing the paper in a very short timeframe: ChatGPT was released on 30 November 2022, became known in expert circles in December 2022 - and in January 2023 we published the preprint. This efficient process, spanning merely three weeks, encompassed everything from the initial idea to the formation of a collaborative team for human evaluation of the LLM's outputs, task delegation, writing, and finally, submission to arXiv. 
 
However, the original paper had certain limitations, some of which were addressed in later works; for others, the work is ongoing. In the follow-up study, which the author co-wrote with another research group, "Evaluating Language Models for Mathematics through Interactions" (currently under review), multiple interactions with ChatGPT on the same topic were explored. By contrast, the original paper evaluated the performance of ChatGPT solely in a single-shot setting, where ChatGPT only had one chance to get it right. 
 
In "Large Language Models for Mathematicians" (accepted for publication), a subset of the authors of the original paper revisited it in a write-up oriented towards an audience of mathematicians (rather than an audience of machine learning researchers). It provided more concrete examples of use cases in daily mathematical practice, a short survey on LLMs, as well as a formal write-up of one of the currently most widely used, free LLMs, LLaMA2, released by Meta this summer, that abstractly describes how transformers process the data. 
 
The author's most recent publication, "LLM vs. ITP", compares mathematical knowledge of two critical pieces of technology that can help advance automation in mathematics, LLMs, and ITPs. The latter, interactive theorem provers, are systems that allow mathematics to be written as "mathematical code" and parsed for correctness. Since both LLMs and ITPs encode mathematics --although in different ways-- such a comparison is both intrinsically important as well as of interest from the perspective of tasks such as auto-formalisation or formal proof generation that apply LLMs in the context of ITP in order to disentangle the difficulty of generating mathematical code, including ITP-specific formalisms, from the difficulty of the mathematics that is formalized. 
 
One of the particularly tedious things in writing the "Mathematical Capabilities of ChatGPT" paper was the human evaluation of over 1600 outputs of various LLMs. One of the author's future research directions is along the lines of reducing the human cost of the evaluation, as well as bridging graduate-level mathematics and machine learning. 

The future of AI 

More generally, the intersection of mathematics and AI is growing quickly: The number of entries to the NeurIPS '23 MATH-AI workshop doubled in size compared to the previous year.  

Recently, the AI-MO prize was announced for an AI capable of solving Olympiad-level mathematics problems. After hundreds of years of doing mathematical research mainly in the same way (pen and paper, chalk and blackboard), the moment seems to be around the corner when digitalisation enters mathematical research. Exciting times await mathematicians and machine learning researchers alike.