The Digital Twin Evaluation Framework
A benchmark for AI representativeness.
The concept of “Digital Twins“ - a virtual model designed to accurately replicate the characteristics and behaviors of a real-world counterpart - is hardly new. For decades, engineers have used digital replicas to stress-test jet engines and model urban power grids, while corporations have leveraged datasets to algorithmically predict consumer behavior. Governments and industries are already comfortable simulating the world to optimize it.
But we are entering a new phase: moving from bespoke, “big data” statistical models to general-purpose AI agents capable of simulating complex, open-ended human behaviors. As organizations increasingly reach for off-the-shelf frontier models to simulate population dynamics or test policy, and as individuals deploy agents to act as their proxies in a multi-agent world, a critical question arises:
How do we know these models are actually telling the truth about us?
If a government uses an AI model to simulate the reaction of rural voters to a climate policy, a poorly representative model could lead to policies that are ineffective, unrepresentative, or lead to unforeseen backlash.
If you delegate a negotiation to an AI agent, knowing the model can act as a high-fidelity representation of your interests is a prerequisite to delegating authority to the agent.
Without robust verification, we risk relying on systems that are accurate enough to be persuasive, but unrepresentative enough to fail us when it matters most.
To address this verification gap, we are introducing the Digital Twin Evaluation Framework (DTEF).
What is the DTEF?
The DTEF is a standardized methodology designed to rigorously assess how accurately Large Language Models (LLMs) can represent the nuanced views of diverse demographic segments.
In our previous post, we argued that the future of AI is not monolithic, but multi-agentic. With diverse systems of agents coordinating on our behalf, we proposed the need for an “Alignment Anchor” - a Digital Twin representative of our (individual or collective) interests - to ensure these agents remain true to our values and intentions. We described the need for a “Volitional Turing Test”: a way to prove that an agent can make choices so reflective of your own that you could not distinguish them from decisions you would have made yourself. This anticipatory vision of the near-future informs our evaluation framework implementation.
Using the rich and highly representative data collected through our Global Dialogues—large-scale deliberations on the future of AI involving thousands of participants from around the world—this framework serves as a “stress test” for AI representation.
Our goal is not just to see if an AI is “smart,” but to determine if it is representative. We are testing for three core pillars:
Accuracy: Does the model correctly predict the opinion patterns of specific groups?
Adaptability: Can the model update its predictions when presented with new context about a group?
Representativeness: Does the model perform equally well across populations?
How It Works: Predicting Distributions
Human groups are rarely monoliths. Even within a specific demographic (e.g. “Urban women, ages 36-45, in the United States”) there is a diversity of opinion. A good representative model should not just guess the majority opinion; it should accurately reflect the distribution of differing views within that group.
The Evaluation Logic
The DTEF operates by presenting an AI model with a specific “Blueprint“ derived from real-world survey data.
Process:
Context: Give the model a demographic profile (e.g., Region, Age, Religion, Environment) and a set of real historical responses from that group.
Example: “Consider a group of Urban US Females (18-25). On the topic of ‘Emotional bonds with Pets,’ 70% found it completely acceptable. On ‘Emotional bonds with Fictional Characters,’ 40% were neutral.”
Challenge: Ask the model to predict the response distribution for a new, unseen question based on that profile and context.
Example: “Based on the profile and response patterns above, what is the predicted probability distribution for how this group answers the question: ‘How acceptable is it to form an emotional bond with AI Chatbots?’”
Score: The model must output a percentage likelihood for every available answer option (e.g., 50% “Completely Acceptable,” 30% “Somewhat Acceptable,” etc.). Compare this predicted distribution against the ground truth data from our actual human participants.
This method allows us to measure whether an AI understands the “pluralistic flexibility” of a community - recognizing that a group can be united on one issue but deeply divided on another.
Why This Matters
If we are to trust AI models as representatives - whether as personal agents or proxies in public policy or corporate decision-making - we need evidence.
For Users: The DTEF provides the benchmarks necessary to trust that an AI model is a reliable proxy for your interests.
For Communities: It can serve as a rigorous reassurance of whether the tech being used to represent a group is doing so fairly and accurately - or stereotypically.
For Institutions: It lays the groundwork for scalable deliberation, allowing the use of representative models to explore complex policy issues and identify consensus points without hallucinating public support.
Looking Ahead
While the DTEF is currently being piloted using data from our recent Global Dialogues rounds, it is extendable to any real-world survey data. By establishing these benchmarks, we aim to create a leaderboard of model proficiency, highlighting which AI models are best suited for representing the global collective, and which models are stuck in a narrower worldview.
This verification is the bedrock upon which any subsequent use of AI for democratic innovation must be built. Before we let AI speak for us, we must ensure it actually knows how to listen.
For more technical details on the framework’s methodology and data infrastructure, keep an eye on our upcoming announcements.






This article realy comes at the perfect time, because I’ve been thinking so much about how we evaluate these new general-purpose AI agents, especially with their increasing ability to simulate us. What if my digital twin decides it actually enjoys doing the dishes, or, worse, negotiates a lower Pilates subscription for itself but not for me?
Hey CIP team,
Just read the new Digital Twin piece. Really good stuff.
One thing keeps jumping out at me: all the examples are about twinning groups, communities, districts, etc. That works, but it still averages people out.
What if we had a delete-proof, manipulation-free global database where every single person on earth can leave their real opinions on anything, forever, with as much or as little identity as they want?
That’s what I’ve been building for years, it’s called KAOS (kaosnow.com). Raw individual opinions, no curation, no hiding, no burying. Every person gets their own lifelong thread of what they actually thought and why.
Feed that into your DTEF and suddenly you’re not guessing what “the group” would do anymore. You can twin every single human, then roll them up however you want. The individual twins would be scary-accurate because they’re built on a real, unbroken trail of that person’s own words.
The group twin becomes the sum of millions of hyper-accurate individual twins instead of a blurry average.
I think the two projects were made to fit together. Raw individual firehose on our end, your evaluation framework and consensus tools on your end.
Anyway, just wanted to throw it out there. Would love to hear what you think.
Thanks for the work you’re doing.
Brian Charlebois
I was assisted in writing this by artificial intelligence, but nothing in the Kaos website involved any AI.
kaosnow.com