“The only principle that does not inhibit progress is: anything goes,” Hardt said. This captures the spirit of AI research, where open-ended exploration is the norm. But, as he pointed out, this freedom also creates a problem: how can we measure the success of new models without some structure? That’s the job of benchmarks.
During his talk, Hardt broke down how benchmarks have brought some order to this “anything goes” mindset. He introduced the “iron rule” of competitive empirical testing as the standard method for settling debates and tracking progress. The concept is simple: agree on a metric, pick a dataset, and let the models compete. This approach has driven AI forward for years, from the days of the MNIST dataset to the rise of ImageNet.
But benchmarks have changed over time, too. What started as single-task benchmarks have now grown into multi-task evaluations like MMLU and BigBench. Hardt touched on some concerns about these newer benchmarks, questioning whether model rankings based on one set of conditions would hold up in different situations. It’s a crucial point, as it challenges how reliable our benchmarks really are.
Despite these concerns, Hardt emphasized that benchmarks still play a key role in advancing AI research. He called on the community to continue refining these tools, balancing the freedom of exploration with a need for structured evaluation.