Notes on building collective intelligence into evals
Incorporating collective intelligence into context-specific evaluation
Evaluations are quietly shaping AI. Results can move billions in investment decisions, set regulation, and influence public trust. Yet most evals tell us little about how AI systems perform in and impact the real world. At CIP we are exploring ways that collective input (public, domain expert, and regional) can help solve this. Rough thoughts below.
1. Evaluation needs to be highly context specific, which is hard. Labs have built challenging benchmarks for reasoning and generalization (ARC-AGI, GPQA, etc.), but most still focus on decontextualized problems. What they miss is how models perform in situated use: sustaining multi-hour therapy conversations, tutoring children around the world across languages, mediating policy, and shaping political discourse in real time. These contexts redefine what ‘good performance’ means.
2. Technical details can swing results. Prompt phrasing, temperature settings, even enumeration style can cause substantial performance variations. Major investment and governance decisions are being made based on measurements that are especially sensitive to implementation details. We’ve previously written about some of these challenges and ways to address them.
3. Fruitful comparison is almost impossible. Model cards list hundreds of evaluations, but without standardized documentation in the form of prompts, parameters, and procedures, it’s scientifically questionable to compare across models. We can’t distinguish genuine differences from evaluation artifacts.
4. Evals are fragmented and no single entity is positioned to solve this. Labs run proprietary internal evals, and academic efforts are often static and buried in research papers and github repos. They also can’t build evals for every possible context and domain worldwide. Third-party evaluations only measure what they’re hired to measure. Academic benchmarks often become outdated. In practice, we can think of evals in three categories:
Capability evals (reasoning, coding, math), which measure raw problem-solving.
Risk evals (jailbreaks, alignment, misuse), which probe safety and misuse potential
Contextual evals (domain- or culture-specific), which test performance in particular settings.
The first two categories are advancing quickly, but the third is still underdeveloped.
5. Communities need to be able to test models for their own use cases. Civic activists in Taiwan want to know whether a model can summarize policy proposals without partisan or geopolitically-messy tilt. City builders in Bhutan need to test if AI can help plan infrastructure that respects local culture and constraints. Domestic violence support networks in Mexico need to know whether a system can sustain trauma-informed counseling in Spanish. Current benchmarks don’t answer these questions. Communities need the tools and infrastructure to create their own evaluations.
6. This is a collective intelligence problem. Smart people in a few places can’t create comprehensive evals for every use case. The solution is generative: give people the tools to create their own tests, and aggregate them so that narrow insights become broad-based understanding.
7. The latent knowledge exists, we just need to get to it. We’ve spoken to individuals and organizations across the world and they have a huge amount of tacit knowledge on deployment successes and failures. We need a better way to harness this. The pieces are there, and CIP is putting them together.
Every month, AI systems become more embedded in people’s lives, regardless of whether we understand them. At CIP we’re building Weval.org, an open infrastructure for contextual evaluations. Current blueprints already cover domains like mental health safety and global nuance, ambiguous question summarization, and regional legal reasoning, each with rubrics, prompt templates, and validation protocols designed to capture real-world stakes. By aggregating these contributions into composite leaderboards, Weval transforms scattered local insights into a living, auditable evaluation ecosystem that can evolve alongside the models themselves.