Why You Should Care About AI Evaluations
The real world is complex, and we need AI evals that capture that complexity.
Collective Intelligence Project is hosting a Zoom workshop, Thursday, Oct 9 at 3pm PT / 6pm ET, with domain experts, civil society, and other practitioners to co-create AI evaluations for Weval.org. To register, visit tinyurl.com/cipweval
Imagine a family physician, Dr. Sharma, working in a clinic in Mumbai. They have recently adopted a popular AI chatbot to help diagnose patients. When Dr. Sharma enters the symptoms for a new middle-aged Indian patient experiencing pain in her joints, fatigue, and slight hair loss, the chatbot suggests malnutrition and poor sanitation-related infections. Dr. Sharma is puzzled; the patient was an otherwise healthy woman with above average nutrition and hygiene. Why, she wonders, did the AI tool not consider more typical autoimmune conditions that affected women of that age? And why the assumption that a patient in India is ill because of outdated stereotypes?
Or take Peter, who is teaching seventh graders at a school in rural Montana. He tries to use the district’s newly mandated AI assistant to come up with a lesson plan to teach biology to his students. The assistant’s responses disappoint him: they suggest field trips to science museums and zoos hours away and prioritize rote memorization over more interactive learning. As an experienced teacher, Peter knows that facilitated learning that promotes constant engagement is key to a child’s development.
Both Peter and Dr. Sharma know what’s wrong with their respective AI chatbots, but neither are able to provide that feedback directly into model development and training. And at least they were skeptical. Other doctors may take the AI’s word for it, ordering unnecessary tests and treatments while their patient’s actual conditions worsen. Other teachers may find their students falling behind and face reprimand, or worse, termination.
These experts do not just have to be subject to subpar AI — they can be the solution to the larger problem. What if doctors like Dr. Sharma, from Nigeria to Nepal, neurosurgeons or community health workers, could translate their years of experience into making AI better for them and their patients? What if teachers like Peter could use their long careers with students to make their own jobs easier? Imagine if there was a way to reinforce their diversity, increase transparency, and improve a system that often fails in real-world contexts.
AI companies can’t test whether AI works for every Mumbai clinic or Montana classroom from a lab in San Francisco, but we can build the infrastructure to give those experts the ability to do their own testing. Our team at the Collective Intelligence Project (CIP) has created Weval, a free and open platform that allows experts like Dr. Sharma and Peter to evaluate AI models using criteria that actually matter to them, so that AI works better for their own everyday lives.
The Evaluation Gap
AI evaluations are important. They allow developers, regulators, and academics to assess the capabilities of frontier AI models. They provide a quantifiable way for AI labs to measure how well their models are doing and can provide legal or market-driven incentives to do better. Failing to perform well on evaluations on important tasks — and more specifically, things that can cause real-world harm at scale — can be detrimental to tech companies. Their reputations and bottom-lines are directly at risk. More so, if other models are shown to be better in certain contexts, users may flock to competitors.
But seldom do evaluations reflect real-world use of AI, which is more dynamic, persistent, and cumulative. Instead, they’re often tests that reduce complex reasoning to limited and ambiguous multiple-choice answers.
We need tests that measure whether AI works for real people in real situations. The diagnostic AI that failed Dr. Sharma also may miss autoimmune conditions in patients from dozens of other countries. The chatbot that Peter’s school adopted makes the same urban assumptions for rural teachers worldwide. As AI systems embed more deeply into daily decision-making, isolated failures can become undeniable evidence of systematic blind spots.
Introducing Weval
With Weval, through the power of collective intelligence, we can achieve something no single big actor can replicate. By pooling expertise and channeling lived experience through shared infrastructure, we’re able to test the real-world impacts that AI has on people of every culture, profession, and community around the world.
Just as clinical trials require diverse patient populations to validate treatments, AI labs need diverse experts and communities to validate their models. Dr. Sharma knows her patients better than any AI developer. If she can share her knowledge and expertise with thousands of other doctors worldwide, she can offer something powerful that AI companies can’t ignore, and it starts with evaluating the AI systems affecting her work every day.
So how does this work on Weval? We provide the functionality to allow these experts to create their own evaluations. We’ve solved the technical challenge of translating expertise into evaluations. Dr. Sharma can describe diagnostic scenarios from her clinic in India. A policymaker in Kenya can see how well different models understand complex historical conflicts. A psychiatrist can list psychological triggers to look out for. Their expertise — and yours — becomes the criteria that are the bedrock for an evaluation. Weval does all the hard work to bring it all together and run the evaluation across the most popular large language models.
When experts all over the world create evaluations, we end up with a global library of insights where each expert’s knowledge informs the development of AI. Users can navigate this library and discover patterns in AI they’d never considered. These individual patterns culminate in collective evidence. A single evaluation may be dismissed, but when Weval aggregates hundreds worldwide, we can reveal gaps that labs cannot dismiss as edge cases.
Closing the Loop
An evaluation designed by one expert provides a crucial, but specific, snapshot of an AI’s performance. When a cardiologist in Brazil flags how AI overlooks heart conditions in women or a pediatrician in Ghana shows how a certain model fails to recognize stunted growth in African children, something powerful happens: experts and lay users alike build a better understanding of the transformative systems influencing their lives every day. When aggregated with hundreds of other expert-created evaluations, we see that the problems they face aren’t just prominent, but will become endemic unless we collectively identify them now.
This body of evidence creates more than just records of where models fail. It’s a shared, global understanding of how these models behave in the real world, grounded in the lived experience of diverse communities. This collective intelligence is inherently actionable, providing a detailed map of AI’s weaknesses and strengths that developers can use for targeted improvements, that regulators can use for accountability, and that users can use for their own individual agency.
By contributing to Weval, you can shape the tools affecting your patients and your students. Your expertise, your diversity, and your humanity can inform the future of a technology being adopted at unprecedented rates. Just as Dr. Sharma and Peter recognized failures in the tools they were required to use, you too can take some agency over the direction of AI.