Experts Warn of Serious Flaws in Crowdsourced AI Benchmarks
AI companies like OpenAI, Google, and Meta are now increasingly relying on crowdsourced platforms like Chatbot Arena to understand the strengths and weaknesses of their new models. However, some experts believe that there are serious flaws in these AI benchmarks, which raise questions about the accuracy and effectiveness of this process.
But some experts believe that this method is neither completely scientific nor ethical. Emily Bender, a good professor at the University of Washington and co-author of the book “The AI Con”, expresses special concern about this process. She says that platforms like Chatbot Arena cannot measure the real understanding of any model, because here decisions are based only on the user’s choice, not on any fixed standard.
Asmelash Hadgu, co-founder of an AI company named Lesan, says that many AI labs are using these benchmark platforms to exaggerate their achievements. He gave the example of Meta’s Llama 4 Maverick model, where the company hid the version with good scores and released a weaker version instead.
Hadegu believes that the benchmark system should not be kept static at one place. It should be developed in collaboration with different institutions and universities so that it can be better used in specific areas like education, health.
Both Christine Gloria, a former technology expert at the Aspen Institute, and Hadgu believe that people who test models should get remuneration for it. Gloria said that AI companies should learn from the mistakes of the data labeling industry, where workers are often exploited.
Gloria also believes that benchmarking through crowdsourcing is a good step, but it should not be made the only criterion for testing any model. Technology is changing so fast that today’s benchmarks may prove useless tomorrow.
Gray Swan AI CEO Matt Fredrickson says that volunteers come to his platform not only for rewards, but also to learn something new and improve their skills. But he also admitted that public benchmarks can never replace professional private testing.
He believes that developers should also get their models tested by professional red teams and experts. And most importantly – whatever results come, they should be clearly communicated to the people.
OpenRouter CEO Alex Atallah and UC Berkeley researcher Wei-Lin Chiang also believe that only open testing will not work. Chiang, who is one of the founders of LMArena (the team that operates Chatbot Arena), says that his aim is to provide an open and trusted space where the community’s opinions can come forward.
On the controversy of Meta, he said that this was not a design mistake of Chatbot Arena, but AI Labs misunderstood its policy. Now LM Arena has updated its policies so that such mistakes do not happen in future.
Finally, Chiang said, “Our community is not just volunteers or testers. People come here to understand AI and provide feedback. As long as our leaderboard reflects the real opinions of the community, we welcome it wholeheartedly.