The Emerging Science of Benchmarks: Moritz Hardt on AI’s Open-Ended Progress
Artificial intelligence has always been about experimentation and pushing boundaries, guided by the idea that “anything goes.” But in a field that moves so fast and has so many branches, how do we really know we’re making progress? That’s where benchmarks come in, helping to guide researchers and keep us all on the same page.
At the recent Hi! PARIS Summer School, Moritz Hardt from the Max Planck Institute for Intelligent Systems explained the role of benchmarks in AI research.
“The only principle that does not inhibit progress is: anything goes,” Hardt said. This captures the spirit of AI research, where open-ended exploration is the norm. But, as he pointed out, this freedom also creates a problem: how can we measure the success of new models without some structure? That’s the job of benchmarks.
During his talk, Hardt broke down how benchmarks have brought some order to this “anything goes” mindset. He introduced the “iron rule” of competitive empirical testing as the standard method for settling debates and tracking progress. The concept is simple: agree on a metric, pick a dataset, and let the models compete. This approach has driven AI forward for years, from the days of the MNIST dataset to the rise of ImageNet.
But benchmarks have changed over time, too. What started as single-task benchmarks have now grown into multi-task evaluations like MMLU and BigBench. Hardt touched on some concerns about these newer benchmarks, questioning whether model rankings based on one set of conditions would hold up in different situations. It’s a crucial point, as it challenges how reliable our benchmarks really are.
Despite these concerns, Hardt emphasized that benchmarks still play a key role in advancing AI research. He called on the community to continue refining these tools, balancing the freedom of exploration with a need for structured evaluation.
Moritz Hardt (Max Planck Institute for Intelligent Systems) delivering a keynote at the Hi! PARIS Summer School 2024.
So, why do benchmarks matter?
They matter because they give us a way to measure where we are in AI research. They promote both collaboration and competition, pushing researchers to improve their models. More importantly, they help us understand the strengths and weaknesses of AI technologies, steering the field toward real-world applications.
As we move into an era of more complex, adaptable models, benchmarks will keep evolving. But their role as a guiding framework stays the same: they help us figure out what works and where to go next.
Moritz Hardt’s talk was a strong reminder that while AI research thrives on freedom and creativity, benchmarks are what keep us on track.
Stay tuned for more insights from Hi! PARIS Summer School 2024 and the latest in AI research!