Evaluating AI: What’s New and Why the Development Sector Should Care

Opening Remarks

Han Sheng Chia, Policy Fellow, Center for Global Development

Panelists

Rebecca Sharp, CEO, IDinsight

Temina Madon, CEO, Agency Fund

Gabriel Demombynes, Manager, Human Capital Project, World Bank

Mohammed Husain, Technical Success, OpenAI

Moderator

Markus Goldstein, Vice President and Senior Fellow, Center for Global Development

The use of generative AI tools in low- and middle-income countries is multiplying—from math tutors for children to advisory tools for farmers. Yet there is limited evidence on whether they work as intended or improve lives. While rigorous impact evaluations remain vital, new AI-specific practices—prevalent in the tech sector but potentially unfamiliar to development practitioners—offer fresh ways to assess the performance of development interventions. This panel will unpack these novel approaches and why they matter for policymakers and practitioners.

[00:00:00]Good morning. I’m Han, and I lead the AI Initiative at the Center for Global Development. Thank you for joining us, and please allow me to give some opening comments before turning to our wonderful panel.

At the AI initiative, our work helps the development sector make smart bets on AI—and just as importantly, learn how to evaluate them. That’s what today is about: figuring out how to assess the thousands of generative AI tutors, health assistants, and farmer coaches now out in the world.

When I talk to economists about this, they sometimes say, ‘What do you mean? We evaluate it like any other intervention.’ If an AI-powered farmer coach—a WhatsApp chatbot for farmers for example—claims to boost yields, you measure whether it actually did. In other words, you run an impact evaluation.

Impact evaluations are critical. We absolutely have to test whether the use of the AI application improves a development outcome.

But today I want to broaden how the development sector thinks about evaluation—by introducing a few approaches common in tech but less familiar in development—and make the case for why we should complement impact evaluations with them.

First, we’ve all seen generative AI go wrong—like Gemini telling users to put glue on pizza, or reports of chatbots encouraging self-harm. These are not behaviors we want in AI tools for development.

Evaluating whether an AI behaves the way you want is called model evaluation. And it’s not just about preventing mistakes or harm—it’s also about shaping the behavior we do want. For example, in education, human tutors are expected to follow good teaching practices, like guiding students instead of giving answers away. AI tutors can be evaluated on these same qualities—and tuned to hold back answers, challenge students and even teach at the right level.

This kind of evaluation happens way before an impact evaluation, but as you can imagine, can have meaningful implications for the results of the impact eval.

Beyond checking if the model behaves as intended, we need to see if users stay engaged—and how they engage. Even the most pedagogical sound tutor won’t matter if everyone drops off, or if engagement boosts usage but increases loneliness instead of improving thoughts, feelings, and actions.

Now you might be thinking—how is any of this different from a well designed traditional, non AI development program—don’t you want to track all of these things in a non AI program too? Conceptually it’s similar, but the methods and execution are quite different.

Additionally, AI brings 2 differences— visibility and speed.

On visibility: we can now observe both AI and user behavior across the entire user base, not just a sample. Every digital interaction can be recorded and scored, and aberrant behavior can be detected before and during deployment. Traditional programs could only spot these issues later—and only through slow, sample-based observation. This actually makes evaluations with AI easier and more comprehensive.
On speed: once you spot something to improve, AI interventions can be adapted far faster than traditional programs. Updating an in-person tutoring program might take months of redesign and roll out, but an AI tutor can be updated and retested in hours—tightening the evaluation-to-improvement cycle.

Taking advantage of these benefits means development practitioners need new skills and closer collaboration with computer scientists and other technical experts—something that may be new for many.

There’s real risk in skipping these steps and jumping straight to impact evaluations. If we don’t understand how the model behaves or how users engage, what can the results really tell us? An AI tutor that teaches well today could, after a tweak, turn into a sycophant—praising everything a student says. One impact eval might find ‘AI improves learning,’ while another, after the change, finds ‘AI harms learning.’ These tech shifts happen far faster than in traditional programs, and without tools to track them, we risk learning the wrong lessons.

The point is, there’s no single ‘AI.’ To say anything meaningful, we need to evaluate and track the specific behaviors and features our AI applications are designed to exhibit. Then we can we make generalizable claims about which configurations actually lead to better outcomes.

I’ve made some declarative claims today and called for new approaches—but these ideas need discussion. Our panelists will kick that off, and in December CGD will convene a working group of 25 experts to start setting standards for high-quality AI evaluation—from model performance to impact, with publication expected end Q1 2026.

Thank you again for joining us and let me turn it to Markus Goldstein CGD’s VP and senior fellow to introduce the panelists.

[00:05:53] Markus Goldstein: Thanks, Hans. Let me start by introducing our panel today. We have an exciting group of folks. To my immediate left is Gabriel de Mombins, who's a manager of the Human Capital Project at the World Bank. Tamina Madan, the CEO of the Agency Fund. Becca Sharp, who's the CEO of ID Insight. And Mohammed Hussein, who's a solutions architect with OpenAI for Government. So, Gabriel, let me start with you. Big picture. There's so much hype about the promise of AI. There's also some reverse hype about the dangers of AI. What are you seeing as you talk to governments and the development professionals at the World Bank?

[00:06:49] Gabriel Demombynes: Yeah. So, first, I'm very happy to be here, so thanks for the opportunity to speak to this group, and I'm looking forward very much to learning from other panel members on alternative or broader approaches to evaluation. So, first, I want to say that I spent most of my career at the World Bank working in bank country offices around the world, and in those experiences, I've studied health education systems. I've visited countless schools, countless health facilities, and while there's lots of bright spots, I think, overall, I could tell you about a whole kaleidoscope of calamities that I've seen in schools and health facilities where students are sitting in the classroom, but not learning anything, or going to health facilities and getting very low-quality care. And that's the challenge that we're starting with, and I think it's because the challenge is so vast that there's so much excitement about what AI could offer in helping address all of those challenges. So I think there's tremendous promise. At the same time, there's a long list of perils. I think we'll probably talk a bit about some of those perils. The first peril is simply that governments could end up wasting a lot of money on solutions that don't work, and I've certainly seen in conversations in the last year or so with health officials and education officials in various developing countries, they're just bombarded now with pitches from private companies saying, AI can solve all of your problems. You don't need to worry about training health workers because AI can do everything that a doctor or health worker would do. You don't need to worry about training your teachers because you can just give them an AI tutor. And it's extremely difficult for those officials to judge what's hype and what's reality. And so there's a real danger that they're going to be sucked up into the hype and end up going with applications which don't actually deliver. So I think that's why it's so important for us to be thinking about how can we do rigorous context-specific evaluation that can help inform those policy makers.

[00:09:03] Markus Goldstein: Awesome. Okay. So, Gina, you've thought a lot about how to evaluate this. Can you give us, you've been part of developing or leading the development of these four stages of evaluation. Can you give us a rundown of sort of how that might help address, well, first, what it is for those who don't know, and then also how it can help address Gabriel's concerns?

[00:09:31] Temina Madon: Perfect. Thank you. Thank you, Markus. So we have developed, along with colleagues at OpenAI and CGD, a framework for evaluating AI products. What I will say is it's not just a framework for evaluation. We see it as part of the development process. When you're building an AI product, you need to be continuously evaluating. And that's a bit of a paradigm shift from where we've been in development over the last 30, 40 years. There are four stages in this framework. We think of them as motors or engines that you switch on at different points in time. And then you keep them running. The first, as Hans said in his introduction, is model performance. Is the AI model that you're deploying, or the suite of models you're deploying as part of a service, performing as expected? Is it delivering safe, accurate, reliable outputs when you prompt it? Is it producing hallucinations? Is it engaging with the personality that you want it to engage with? There are a number of different techniques you can use here. Part of the focus is just on monitoring outputs when given a reference set of inputs. Does the model produce reliably the expected set of outputs? But you can also use something called red teaming, which is where you get a group of users or experts to come up with prompts that try to test and poke the model at the edges, and find the edge cases that push it to perform poorly. And then you can resolve those issues. So that is model performance. It's really understanding how the AI in your system behaves, and as Hans said, shaping it through rapid iteration in the various knobs and buttons you can push on an AI model. The second phase, which is the second motor we switch on, is understanding product performance. AI is one of the inputs to a product, typically, that's delivered to users, whether that user is a mom needing assistance with childcare, or a teacher who is getting a co-pilot to help her in the classroom. Product analytics include things like user engagement, user retention, the kind of lifetime value to the user of a product. These are things that you can track over time, and they really speak to product adoption. Is the product being used as you intended? What we've seen in digital technologies being brought into development for the last 20 years is that adoption is often overlooked. We're getting tools deployed that may be engaged 10% of users, but we're trying to onboard many more users than are actually the number that are benefiting. So product analytics and product evaluation is a second motor, and we think it's critical as you switch on that motor to keep testing changes in features of your product. So introducing A-B testing, introducing new features, always looking to boost engagement and retention in a way that is meaningful for social impact. The third motor that we switch on is what we call user evaluation, and here we want to understand how users are affected by AI. We know that AI can sort of nudge users in a given direction, but it can also harm users and it can sort of pull back the sense of agency they experience in their own lives. So we want to understand users' feelings, thoughts, behaviors, actions, and how the AI human interface evolves as a user starts using an AI product. And that third motor is really where I think we haven't spent a lot of time as a development economics community, as a research community, because this technology is so new. And I think that's where more research is needed. The fourth motor we switch on is, of course, impact evaluation. Testing for longer-term development outcomes, usually with a counterfactual group. And I think what is novel here is you have four motors that you can switch on. Each of them has a different cost profile. It's very cheap and rapid to evaluate and modify and retest the models that you deploy as part of a product. Even making product feature changes can be quick. Of course, impact evaluation at the other end is relatively more costly and it takes longer to generate results. But that's sort of the framework that we developed.

[00:13:55] Markus Goldstein: Awesome. Thank you. Becca, at ID Insight, you've started building a team to do the kind of evaluations that Tamina was talking about. Why?

[00:14:12] Rebecca Sharp: Well, thank you, first of all, for convening this discussion. It's really nice to be here with such thoughtful partners in the sector. This is a really rapidly evolving time. It's funny because we're having a conversation about AI evaluations, but ID Insight has actually long been tooting the horn that we need a wide suite of methodological tools, that RCTs are not the hammer to try to find every single nail, but that we need the right tool at the right time for the decisions, budgets, timeframes, and decisions that development practitioners need to improve their impact. To me, this just feels like one logical next step given this burgeoning technology of what was already a robust philosophy of expanding the toolkit for decision makers. And so to think about the four-pronged framework that Tamina laid out, a lot of it is also understanding that impact evaluations, the traditional kind of fourth stage RCTs are telling you whether something had impact or not, but not why. And so to figure out why, we absolutely need these other tools in the toolkit at all three stages of phase one, two, and three, product, model, product, and user testing. And I just wanted to double-click on that third stage that Tamina talked about where more robust research is needed. I don't think it's a question of whether or not there will be unintended effects of this technology. I think that is a given. We should expect that. The use of this tech, you can think about all of the promise, and really the promise is truly incredible for humanity. The idea that a community health worker would be able to provide very accurate diagnoses with a co-pilot in hand to be able to help them do their job better, but what about the fact that they will be reading off an app in the interaction with the patients? What about the fact that the patient may wonder what are they doing and where are they getting this information that they're diagnosing me with? And also the fact that they may start to pay less attention to the patient engagement as they are relying on these tools. And you can think of the same in education, agriculture, livelihoods, applications. There are just so many ways in which the user experience and the dignity and cultural appropriateness of the responses, the fact that much of the training data comes from the global north. There's so many things to pay attention to in terms of how this technology is deployed that we need all of those levels before and in addition to doing large-scale counterfactual impact evaluations. And the last thing I'll say about this, just lest you misunderstand, is that of course we shouldn't throw out the baby with the bathwater. We also need the large-scale counterfactual rigorous impact evaluations. And I think that we can discuss more when that is appropriate. But for over a decade now, we at ID Insight have been telling partners who come to us asking for an RCT, are you ready for the RCT? First you need to make sure that you are implementing your program with fidelity and that you are operationally mature enough to give yourself the best chance at demonstrating impact. And so the same logic applies. I think we do need rigorous counterfactual RCTs, but we also need all of these more rapid process-focused dignity and user engagement-focused stages.

[00:18:22] Markus Goldstein: Awesome.

[00:18:23] Han Sheng Chia: Thank you.

[00:18:23] Markus Goldstein: Mohammed, it's great to have a computer scientist on the panel. Can you, I'm an economist, I worked with teams when I was at a tech company where there were computer scientists. Can you tell, and I think, you know, one thing I learned is I knew nothing. So can you tell us what kind of evaluations you're looking for, you're thinking about in this context and why they matter?

[00:18:59] Mohammed Husain: Sure. First, Markus, thanks for having me, and good morning, everyone. So on the Open AI for Government team, I'm a solution architect, which means I'm a sort of technical partner to all of our government partners who are using our tools. And what I've seen is there's sort of a bifurcation with where I see large AI research labs like Open AI and others focusing their evaluations, and then where partners like government institutions and NGOs focusing their efforts. Some folks that might have heard this term, I think this is a great framework. There's this framework of verification and validation. Verification answers the question of, did we build the solution right? And validation answers the question of, did we build the right solution? And what I posit is that AI research labs have been focusing and will continue to focus a lot on the verification problems, and partners like ID Insight, Agency Fund, and others will be the right folks to be able to answer the validation question through RCTs and other analyses of, did we build the right solution? So give you an example, right? One example we did was HealthBench is a good example. So HealthBench was a benchmark we released maybe a few months ago. This was 5,000 question answer pairs that we sourced from I think about 250 physicians across the world, 60 languages, I think 30 medical specialties. And so this benchmark was designed to evaluate frontier models like GPT-5 and others on healthcare tasks and identifying the extent to which they would properly escalate medical emergencies, the correct use of context elicitation from the user to gather more information. So this is a verification problem. It's designed to test for things like accuracy and doing the right thing, right? But ultimately, it doesn't answer the question of adoption and actually making measurable economic impact or impact on key healthcare statistics. An example of a validation benchmark is something like GDPVal or Apex. So GDPVal was a GDP evaluation that OpenAI put out with partners. And so this aimed to measure what was the economic impact of some of the work that AI models were doing. So for this one, I think it was about 1,300 questions. We had 44 job families across nine industries. And so this one represented how much would those products actually be accepted by working professionals across real estate and technical services and engineering and healthcare. So at a high level, I think moving forward, a lot of the research labs are going to focus 80% of their time on building better, smarter models, GPT-5, reasoning models, GPT-6. These are better suited for...and the benchmarks that accompany those are more verification benchmarks where we just want to build the smartest possible models and let the world use those models and build systems around it. And where I think these partners will play a key role is that as domain experts, you understand how to bridge the technology to the user and say, just because you have access to an API to GPT-5 does not mean that you are solving healthcare benchmarks or solving big healthcare challenges or ensuring that thousands of more students get access to education. That's the validation side. And so I'm happy to talk a little bit about that more, but as we think about what's the role these different partners play, the fact that we have domain expertise in these different areas makes them a very complementary partner to sort of the technologists at OpenAI. So maybe I'll pause there.

[00:22:18] Markus Goldstein: Thanks, Mohammed. So I think one thing that's cool about how you just discussed that was it was accessible. As you work on these kind of evaluations, what I'm worried about is we have very different disciplines trying to communicate with each other. And I've had conversations with software engineers where they treat me like a moron, I probably act like a moron, and I don't understand what happens at the end. And I've had discussions with others where they painstakingly walk me through the metrics that they're using. So how do we do this without the risk of here's the Stage 1 results, they're all good, I trust you, here's the Stage 2 results, they're all good, take my word for it. And how do we have that conversation so that, not that I as an economist actually fully understand how you evaluated the benchmarks or set them, but do you get what I'm trying to very eloquently get at? Yeah, yeah. We'll take a stab at it.

[00:23:33] Mohammed Husain: You let me know if this helps. Did folks, a quick question for Chris, did any of you see that big acquisition by Meta of Scale AI maybe a few months ago? Maybe a few folks saw that. Meta paid billions of dollars for this AI company called Scale, and Scale did two things. They built fine-tuning data sets to build large language models, and they built evaluation sets, so tests that these models can run. And these evaluation sets were very, very challenging. One of them was called Humanities Last Exam, which is literally the hardest possible set of tests that they could possibly muster, harnessing world-renowned experts in various domains. And the reason I say that is because I think that proves that evaluations become a very powerful economic asset to the folks who can invest in building it. I think that aligns incentives really well to finding ways to work with AI research labs to ensure that developments continue to align with the use cases that are relevant for our missions. So what I have seen, there's three different axes that are important to AI research labs in terms of improving model performance, and it just so happens that these three axes are very closely aligned to some of the work that our partners are doing. The first is low-resource languages. So obviously, I think Becca alluded to it very well, that most of the training data does come from the global north. So we take GPT-5 to certain countries, and they will produce medical recommendations that make no sense in—that neither make no sense in the local context or produce a translated version that doesn't sound correct or doesn't directly meet the local requirements of that area. So low-resource languages, it's very hard for a lab like OpenAI with all its resources in the west to collect that type of information, but our partners here all have deep roots and partnerships in these areas where OpenAI does not. So that's one area where partners are going to be key. The second one is domain expertise. So as part of the global development accelerator, we're looking at healthcare, agriculture, and education. I think education is one area where we have a lot of expertise. Healthcare is a new focus for us, and then agriculture is kind of a newer frontier. And again, those are very high-level, like western medicine and western education, completely different. So again, areas where local partners, like on the healthcare side, Jacaranda Health, Reach Digital Health, are helping bridge that gap between access to a GPT-5 model and developing these clinical decision support systems or medical recommendation systems that actually make a difference. Low-resource languages, domain expertise, and then the last one is more complex workflows. I think one thing we've learned from the accelerators is it's not just give a prompt, get a response. There's these complex scaffolding of, you know, you might need to reference local documentation, or you might need to call different tool sets, or you might need to use, like, a real-time API for a speech-to-speech interaction. So not just building benchmarks that are ask one question, get an answer back, and evaluate if that was correct, but building, you know, like when you build an agentic system, architecting it so you can look at what were the different steps that the model took, what were the different decisions that it made, what tools did it call. So bringing it back, right? Like how do NGOs work with AI research labs in a way that aligns incentives? I think that, you know, building benchmarks is painful, and no one likes doing it. Open AI doesn't like doing it. Meta clearly didn't like doing it, and that's why I think it's like a very powerful economic asset to folks who invest in looking at the data. And then really those three axes, as you think about what benchmarks are relevant to these research labs, every research lab, not just open AI, it's benchmarks that focus on low-resource languages, deep domain expertise, especially as it relates to localization, and then more complex workflows that aren't just text-in, text-out. I think agentic benchmarks are the most painful. We've spent a lot of time internally looking at how to measure that, and the benefit is that there's a lot of tools available for free or low cost to, and not just from open AI, right? There's an entire industry and ecosystem of tooling that these partners can use, so really it's just a matter of taking the time to look at the data, which I know is hard, because I know folks are understaffed, but let me pause there.

[00:27:44] Markus Goldstein: Awesome. Thank you. Becca, you touched on the human interface, the reading of a tablet. So it's different with AI, right? Because I may be reading a tablet when I'm giving you ... I'm not giving you health advice, but the health worker may be. And there may be cases where people are talking directly to a machine. Can you tell us a bit about thinking about the evaluation for that? Because it sort of goes two ways, right? It goes to making a better product, so back to the intervention design, but also speaks a lot to the impacts and the uptake.

[00:28:24] Rebecca Sharp: Yeah, yeah. No, I thought Mohammed's remarks about the different ways of not just seeing if the tool is built right, but if we built the right tool is a really good way to think about it. And my hot take is that I think that a lot of the impact that these tools will have is not actually just based solely off of the technical reliability of the tool. It is a lot about the user experience and all of the choice architecture and design frameworks around how people interact with this. And we've already seen that those impacts are very, very great on both user retention in the user funnel, like what Hans mentioned. Who cares how technically reliable a tool is if literally nobody uses it? How are users actually engaging? Are they engaging too much with the tool such that they're actually over-relying on it and it's displacing some of the other things that they need to pay attention to? The real world effects are extremely nuanced. I was at UNGA a few weeks ago with some of these folks talking to Safina, the founder of Educate Girls, whom ID Insight has partnered with for a long time to support them with finding out-of-school girls in India and educating them. And Safina was like, I'm so excited about the opportunity that AI presents in terms of targeted and personalized learning for these girls, but I also know that the main challenge is to get their parents to let them leave the house. And if I tell their parents, here's an AI tool that will help your girl learn, they're going to say, awesome. She's going to stay at home, do all her chores, and then use the AI tool. And that is not going to allow that girl to leave and go to school and get all of the in-person benefits that come from being around a cohort of other women leaders who are also going to school. And so there are immense sort of deleterious potential effects. And the fact that these, you know, Mohammed, as you called it, I think, what was the word you used? But the experts in domains, I think, yeah, domain experts, it is so important that we learn from them and listen to them. Another example is IFPRI, you know, shout out to IFPRI, they just wrote a blog post last week about all of the unintended effects, but also promise that these chatbot tools have for farmers and, you know, encourage everyone to check it out because these are domain experts that really identified, like, what are some of the gaps, both in terms of accessibility of the technology, in terms of whether we should be targeting extension workers or farmers themselves, that's a very different, right, model. So the language barriers that you mentioned, Mohammed, of there's a lot of vernacular in what farmers use that varies locally from place to place, the tool that we develop has to be able to suss out, you know, all of the different variations in slang, in vernacular and how people describe their challenges, idioms that they use, this is incredibly important. And we have organizations now like CARIA, right, who are actually developing the data sets, you know, going out using local enumeration to help build those databases of local language and all of these things are incredibly important to underpin the research that we do.

[00:32:22] Markus Goldstein: Thank you. Gabriel, Tamina, let me turn to you, Hans talked about speed, okay. I am so worried that, I'm actually working on AI enabled intervention evaluation right now. I'm so worried that by the time we collect our end line data that we're going to have a new version and we're not going to be evaluating the thing we started with, like, in a very substantial way. We've seen releases of learning models on top of large language models that change fundamentally, whether it's going to give you the answer or make you work for it, right? So these aren't trivial changes and they happen so fast. So reflections, like, the RCT world is not a very fast world. And so I'm trying to wrap my head around how we're going to evaluate in this rapidly changing world. Tamina, do you want to go first and then we'll turn to Gabriel?

[00:33:32] Temina Madon: Sure. The first thing I'll say is it's a myth that any intervention is stationary. We know that when we train teachers initially, maybe they don't implement a tutoring program as well as once they've had some practice. So programs change over time in their implementation. We know that we may train other frontline workers, community health workers in a new diagnostic technique, but over time they lose interest and they're not as engaged as they used to be. And that intervention is no longer being implemented as expected. So there are changes over time in interventions. It's just that we haven't figured out a way to understand the structure of those changes. But I think it's a myth to say that any RCT is capturing the impact of a stationary, steady intervention. What we have seen more recently with digital products is that RCTs are carried out knowing that there will be incremental improvements in the product over the course of the RCT. So what you're getting is an intervention that incrementally improves over time. And we make that an assumption of the intervention, that there will be steady changes over time. We saw that with Rocket Learning, a partner that we fund. They ran an RCT. They were running A-B tests on the side of the RCT to understand what improved parent or daycare provider engagement with the product. And this is a product through WhatsApp groups that sends bite-sized activities for child stimulation. And they did find that as they were pushing changes that improve engagement, the impact improved from midline to endline. So I think we have to expect that RCTs are going to experiment with time-varying interventions. We can play with the release of new features over the course of an experiment. So we can randomize who gets them or randomize the timing of rollout if we think a change is so dramatic that it's going to be markedly changing the trajectory of the outcomes. But that would be one thing I'd say. The second thing I'd say is that we need to get better at proxy metrics. And there's a lot of creativity we can now deploy with conversational bots and other conversational tools that AI enables. For example, we can be doing sentiment analysis on the questions kids ask of a tutoring bot, understanding their levels of confidence, or understanding changes in the complexity of the questions they're asking. That can be an interesting proxy for learning, that a student is pursuing a path of self-determined learning and that they're using the AI as a scaffold for that learning. So there's a lot of opportunity for us to come up with new metrics that use that digital exhaust of an AI chatbot, but use that to assess in new ways so that we don't have to wait six months for the end-of-term exam. We don't have to necessarily wait eight months for a health outcome to be produced. We can look for those proxies in people's behavior, the way they interact with the chatbot. So those are two thoughts. This is really interesting.

[00:36:41] Gabriel Demombynes: Yeah, just a few thoughts to add to Tamina's very good answer. So my group at the World Bank, the Human Capital Project, has been funding a set of evaluations of AI and human development interventions around education, health, job search by different teams around the world. So this is something we've been worried about. I think for some applications, this is more of a concern than others. For example, one of the evaluations we're funding is looking at using AI for image interpretation in Ethiopia. So this is in 500 rural health clinics where there's a portable ultrasound device which is linked up to an AI interpreter, and we're looking at the outcomes associated with that. In that case, since the AI is integrated with the ultrasound device, it's not something which is going to evolve next month. It's probably a period over years during which that technology is going to develop. Other cases like the one you're describing, things are moving much more quickly. In some sense, that's just c'est la vie, that's the situation that we're in. We're trying to do the evaluations on a shorter-term basis, shorter relative to the typical RCT, which might be three, four years, trying to reduce that more to one or two years. And the other thing we're doing is to build in qualitative evaluations with these quantitative evaluations so we can understand not just the ultimate impact, but also understand what's happening inside the black box. And we expect that will be informative for future technologies, even if the technology continues to develop.

[00:38:19] Rebecca Sharp: If I can just add quickly, in addition to the innovation that we're talking about in terms of AI for sectoral products, there's also now AI-enabled MEL products that the sector has more access to. And we can actually use data science to democratize access to monitoring, evaluation, and learning tools that were previously much more expensive, human resource, capital-intensive to implement for organizations. So an example of that is ID Insight and Agency Fund recently launched Evidential, which is a digital A-B testing automated tool that uses that data science back end to allow social sector organizations to rapidly do all of this A-B testing without maybe all of the barriers that they had before. We've created other tools like Survey Accelerator that allows people to rapidly create and test questions using AI, a tool called Ask a Metric, which allows people to use natural language to query their dashboards. These are all examples of another body of innovation that we maybe don't talk as much about as the health, agriculture, livelihoods innovations, but that I think is also really helpful to the sector.

[00:39:42] Markus Goldstein: Yeah. That's really cool. The democratization is really neat. Questions? Let's take three questions from the room, and then we'll go to two online questions. So if you're online, please put your questions in the chat. One, two. We have a third. Three. Let's spread it out.

[00:40:07] Gary Forster: Thank you very much. Thank you for the presentation. I'm Gary Forster. I'm the CEO of Publish What You Fund, the global campaign for aid and development transparency. Hopefully this isn't too meta, this question, but how are you thinking about evaluating the evaluations? The global development space has a terrible track record of disclosing evidence and evaluations and results. The World Bank is a bit of an exception, but we rank the world's leading aid agencies and actors and impact is the one area where we get the worst data. So given how quickly this space is going to move and evolve and how we all need to be keeping up with what everyone's learning, how are you thinking about making sure that this is all available, especially when big money comes in from governments and the Brits and the Germans and everyone starts funding loads of AI type stuff? How do we learn from all that together? Thank you.

[00:41:00] Angela Ambrose: Hi. Hi. So my name is Angela Ambrose. I used to work at the Poverty Action Lab and I actually left development 10 years ago to work in tech in the private sector. So I work for Discord now. My views do not represent my employer and I'm horrified that this is online right now. So I wanted to push back. Software engineers are not smarter than economists or any other group. I really do want to push back. I also even kind of want to push back that open AI is here. And my question is actually how can open source models be an avenue for development projects like Becca and Tamina, you both mentioned new features and sort of feature drift. If the underlying core large language model like open AI did recently gets taken away, suddenly your intervention is gone. So I think it would be interesting to hear about how you all are thinking about open source.

[00:41:55] Markus Goldstein: Awesome.

[00:41:58] Speaker: In the back. Thank you very much. Interesting panel. Maybe a bit too high level for me. I'm the CEO of a development finance institution. So I'm one of those who's being bombarded literally every week by new tools and new wonderful LinkedIn messages that promise me that all my work will be abolished within three weeks, which I found rather interesting. So my question is twofold. From somebody who is a non-knowledgeable person about everything that is AI, how do small development institutions with limited budgets navigate what's happening right now? Because obviously I see the interest on the financial analysis side, on the impact side, on being more efficient, on reporting side, I see it. But how do you navigate the tools? How do you make a choice between, do I go with developing my own chat bot in-house, because I think my investment analysts will be challenged more, or do I go with a tool and then how do I navigate those tools? That's one question. Second part of the question is, I hear two schools of thought, which probably I don't know what the schools of thought are, but I get two sorts of advice. One says, first, get your data backbone in order. Make sure that your data is in the right place, and all people in my house, they're still working in Excel files, so that's going to be a challenge. And I hear another school of thought that says, forget your data backbone. Just put AI into your data chaos and it will sort it out. That's my second question. Does that make sense to you? Thank you very much.

[00:43:32] Markus Goldstein: Great. Okay. I think that gives us enough to start with. Who would like to go first?

[00:43:39] Temina Madon: Can I jump on the first question? One of the things that we've started advocating is that companies and nonprofits producing social sector AI start investing in product cards. These are like nutrition labels that describe what inputs went into a model, what architecture went into the product, what the product analytics are, so giving us some metrics on user engagement, user attention, telling us something about the user's behaviors, and if you have it, some impact evaluation results. But to get everything into a neat and digestible format, because I think we need that. We've seen a lot of the AI and tech companies come up with model cards. Google invented this. It was Timnit Gebru and her group that first developed model cards as a way for reporting the behavior, the safety, the inputs, and the architecture of AI models. But we need to go a step further than that. And we need nonprofits and the social sector to lead the way, because I think most companies are quite averse to opening up their engine for inspection. So that is one approach we'd like to see is the development of product cards and adoption by the social sector. I think those cards would then allow for comparability across products. And to the other person's question, it would provide an input for decision makers. I as a government or a finance institution then have some transparency that I can ask for of any model or service provider. And I can be comparing across different product choices. It's almost like an open sourcing of what's under the hood for these products. And stay tuned. We're hoping to work on this with CGD to promote adoption of something like a product card as a more compact but very transparent snapshot of what's in products.

[00:45:30] Mohammed Husain: I can go to the second question and maybe a little bit of the third. So first of all, great point. Yes, software engineers are definitely not smarter than anybody else in the room. It's just kind of like LLMs. It's all about the context that you're provided, right? So on the open source side, I think that's a great call out. When we had our midyear event with the accelerator, and one of the things I was keen on learning more about was how much they were using our frontier models to advance health care and science. And as you saw, a lot of them were still using our open whisper model, which is a speech to text model for transcription, or other ecosystem models that were open source because they were very, very keen on understanding cost at scale. And because the models are developed with sort of Western mindsets in mind, that makes things like our speech to speech API, which is like $5 an hour to run, cost prohibitive at scale. Open source has been a very, very good alternative. At the same time, one of the things I've seen that's worked is sort of a hybrid deployment. So we have a lot of organizations that are using our open source models because they're free and they're cheap. And we released a new open weights model about a month ago, two months ago, August, GPT-OSS. So again, free, permissive license, and a lot of different labs are releasing them. So that helps lower the cost of serving. Not only cost of serving, but in terms of being able to deploy on prem, if data residency or HIPAA is top of mind, then you can use these open weights model for a security and cost perspective. However, I do think the other thing to keep in mind with these frontier models is that there's a clear incentive across all the labs to drive down the cost. So the cost of getting GPT 3.5-like intelligence has dropped over 99% over two years. So what that means is over time, the cost of getting very, very good intelligence is basically going to be like the cost of running electricity. And so I think right now, we see a lot of open source architectures. But I think over time, as the models get even more cheap, you'll start to see a little bit more adoption of those models where you can sort of flexibly pull each one out. And to the third question, I probably am the most biased one to answer the question about buying versus building. I have seen a lot, I think, I have seen, the one thing I'll say in terms of buy versus build is that building is a lot more challenging than you think. I often see that a lot of places want to build in-house, but they underestimate the development time it takes to build and scaffold a chat-like system and integrate it with the data sources. So generally, I always advise to buy versus build in-house. And the thing that you will build is not a chat-GPT-like thing for your workforce. It will be building a capability that scales whatever mission that you are trying to achieve. So I'm going to pause there. Pro open source and pro hybrid deployments, and then generally recommend buy versus build. Understand it's always context-specific. I'm going to pause there and welcome my colleagues to add any comments.

[00:48:30] Temina Madon: I wanted to add something quickly on open source. I think in addition to open sourcing models, there's a level of tooling that domain experts need that the private sector is not building. And so we, with ID Insight, released Evidential, the A-B testing automation platform. But there are other tools that philanthropy will need to fill in the gaps for, things that allow for model observability for domain experts. Allow domain experts to carry out red teaming exercises, where they really push the model to its limits. There is an ecosystem of social sector software developers, like human intelligence and Tattle in India. But somebody needs to fund those. We're trying to do some of that funding. But there is an open source layer that needs to be developed for these technologies to be useful to the social sector. On build versus buy, I just want to reinforce, we do think that there will be frontier non-profits that develop products built on very strong data warehouses, where they're able to iterate rapidly, change features, introduce features, and grow the product. Those companies will become product companies. And our hope is that other non-profits can adopt what they've built. So for example, Digital Green has built PharmaChat. They've put millions of dollars into building a product that is well-tested, that has current feature releases coming out every few months. And probably, most extension non-profits should be promoting that product and adopting it, rather than building themselves. Because you can localize. You can make changes within the deployment strategy that fit your context. But you probably shouldn't be rebuilding from scratch. So over time, I'm hoping that we'll see fewer $500,000 grants to fund little projects, and instead see some really strong products emerge from the ecosystem.

[00:50:23] Han Sheng Chia: All right.

[00:50:24] Markus Goldstein: We got three questions from online. I'm just mindful of our time. Let me ask them, and then we'll do some closing thoughts. Is the use of AI in development compatible with the push for locally owned and led approaches? We heard some smatterings of that already. How do we ensure that AI solutions in low-resource settings are still affordable and effective, especially for students? And how do the evaluation methods that we've discussed here address concerns about equity? Do you want to take any of those three?

[00:51:08] Rebecca Sharp: I think those are three related questions, right? Very interrelated. I mentioned this, but I absolutely do think that the process of adopting AI for social good in low-resource context is compatible with local ownership, and in fact, has to be integrated with local ownership, with hearing the voices and having them co-create these tools. You see that both on the Caria side of their business model is basically having local enumerators collect local language data that will feed into these AI models, but then those local enumerators are actually owners of that intellectual property and receive economic benefits, basically passing much of the economic benefit of these AI tools to the communities themselves. So there are not just, I think, ways to do this, but I think it's essential that we do this, especially because as we've launched these models in specific communities, so much of what is going to determine their success is about whether those communities have buy-in, co-create the tools themselves, and something that we as a research community have been talking about for a long time is sharing the results of the research in a participatory way with the communities that we serve.

[00:52:47] Gabriel Demombynes: I think there's a general point on the low-resource settings, how to make these technologies work there. In many, many cases, it's really the technology can work extremely well, but once you put that into the messy environment of the countries that we're working in, it may have no effect or some unintended consequences. A good example I think about is a couple weeks ago we had a conference with George Washington University on precisely this question of applications of AI for human capital, and one of the highlights of that was we had Robert Corum, who is the head of Penda Health in Kenya. He's done some work using AI to advise health workers in clinics, these Penda clinics in Nairobi, and he presented the work. It's an RCT. It has an impressive impact on reducing errors by health workers, so really quite impressive, but then he was very clear in saying that they were only able to implement the solution in these clinics because the clinics have consistent electricity, they have internet connections, they have well-trained health workers, and then most critically, they already had a whole digital medical records system, so the system works by giving health workers feedback on the medical records information as they enter it into the computer. This is not the typical situation in health clinics in many countries in sub-Saharan Africa, so it's a beautiful solution, but it's not actually a solution that's going to work for many of the places that it would work, so I think this really speaks to the importance of developing context-specific applications and making sure the evaluations take into account that context. Awesome.

[00:54:27] Markus Goldstein: Okay. I think we have time for, let's take two more questions here, there and in the back.

[00:54:40] Natalie Sukar: I'll stand up. Good morning, and thank you for an insightful discussion. My name is Natalie Sukar. I'm with Doctors of the World USA. We are a network of organizations implementing programs throughout several countries, mainly in developing countries throughout the world. With AI, the increased use of AI for development and health programs, how are patient informations that are being encoded in the system, how are new AI systems being designed and developed to safeguard and protect patients' information that are being input in the system? And also, is there best practices that organizations should follow when using AI for health programming and provide health care? Thank you.

[00:55:33] Ochiko: Good morning. My name is Ochiko, and I'm the head of ICT and policy for NDPI Foundation. The foundation is actually based in Nigeria, and what we do is provide support around economic development and peace building in Nigeria. We're currently in the process of implementing AI models to be able to help us track the impact of evaluation and our results of the programs in Nigeria. Now there's an ongoing fear within our team that if we implement all of these AI models in five years, are we going to have work to do? Are we bringing in those models to take away our jobs? So there's a lot of fear of if we actually proceed with this, are we going to be jobless in five years' time? Now it's a little funny, but it's actually a valid concern within the team. So my question is, is there a risk that AI is actually going to lead to unemployment in maybe five years' time or 10 years' time, or is there a way to balance? Because I understand that there's a benefit in deploying AI models, but is there a way to balance how we use those models in such a way that we are not unemployed at some point, but rather we're using it to make our jobs more effective?

[00:56:56] Markus Goldstein: Thank you. Thank you. Okay. Panelists, over to you. If you choose to answer one of the questions, great. But also if you want to give us some parting thoughts, since we're at the end of our time. Why don't we start with Mohammed? Sure. We'll work our way back over here.

[00:57:17] Mohammed Husain: Maybe I'll start with the first one. So I believe the crux of the question was around safeguarding the privacy of health care data. Two points to that. I sort of alluded to the hybrid infrastructure, where you're going to have some models that are going to be deployed using the open weights models. And again, one of the big benefits is you can deploy that anywhere. You can deploy it in your own local environment. You can deploy it in your own local data center. They're not being run on Western servers. So I think that is one solution that we have seen most folks use. It is the fastest and easiest solution. The challenge, then, is that you have to have software engineers who know how to deploy the models and serve the models across hundreds or thousands of users. So you have sort of an advantage on the security side. But then it can actually be a little bit more expensive, because you now have to staff it with software engineers and infrastructure engineers. The other side is that we do have cloud-based models. Some cloud-based models are accredited for use on HIPAA data and health care data. And they follow certain controls about data retention and storage with encryption and transient encryption, things like that. So the advantage there is that it's a little bit cheaper to serve, probably, because you don't have to download the weights and buy the infrastructure. You can just access an API. The challenge there is reconciling whether that is good enough for your particular security standards. Whenever I talk to ministers of health, a lot of the time it doesn't really matter that it's accredited for HIPAA. They expect African data residency or something like that. So I think over the long term, you'll see more countries and institutions getting the capacity to build and deploy systems on top of open weights models. And then you'll see, on the other hand, these AI models that are cloud-hosted but hosted on African-based data centers and meeting a stricter set of controls. And that will ultimately give more decision-making control. So right now, you sort of have the option of either or. Over time, I think you will have a lot of different options. So maybe I'll sort of pause there. And I guess, maybe on the point on the jobs, one thing we have seen, I very firmly don't think that AI is going to take anyone's job. I mean that very seriously. I see that, especially in these domain-specific institutes, I mentioned those three axes of low-resource languages and domain-specific expertise. That's where it really is going to require human and machine partnership, right? Especially in the healthcare setting, I think the biggest impact AI is going to have is eliminating the cruft from the healthcare workers' work so that they have more face time with the patients. And I think even if AI gets super intelligent, there's always going to be new diseases, new issues. We don't see one model that is deployed and fixes every problem. It just fixes a set of problems, and then new problems emerge. So it's definitely going to be a bit of a rolling thing where AI solves one problem, new problems emerge. And because of that, I think that the human-machine partnership is a much stronger possibility than the idea that, oh, AI is going to solve all the problems and eliminate the need for the human touch. So I'm going to pause there. Welcome, my colleagues, to, you know, to chime in or contradict.

[01:00:34] Rebecca Sharp: Yeah, I'll borrow Reid Hoffman's framework, which I find really illuminating. He says there's zoomers, bloomers, gloomers, and doomers when it comes to AI. It's actually four categories. And you know, one thing I really think is that we should be both bloomer and gloomer about AI. In other words, we should both accurately see the overwhelming potential for the benefits to humanity of democratizing information, especially in low-resource settings. And we should know that those changes are not even coming. They're already here. And we should be excited about those. At the same time, we should also be quite gloomer in that we should be skeptical and careful of the unintended consequences of these deployments. And one thing I love about this AI evaluation topic is that whether or not you are a bloomer or a gloomer, you should believe in evaluation. We should be investing every single innovation agenda that we put out there in the world should be accompanied by a robust evaluation agenda. Because that is the only way that we are going to be able to separate what works from what causes harm. That's the only way that we are going to be able to realize quickly enough what the best practices and potential negative consequences are and mitigate against them and deploy hybrid interventions. They may even be AI interventions on top of other AI interventions to mitigate some of the unintended consequences that we will face. To go to the last question, I don't know. There are literally doomers out in the world, including in the top models, I think, that are quite concerned about the economic impacts of AI on jobs. And I won't sit here and say that that's not true. I'm not confident in that. But what I will say is that in the short term for every single leader in any sector, but particularly in the social sector, we need to stay ahead of this technology. And it's far more likely that your job won't be replaced by AI. Your job will be replaced by somebody who knows how to use AI better than you. And so I say to my team, to the person who asked the question about change management in the moment, I say to my team, don't do this for ID insight. Don't try to become AI native because it benefits the organization. Become AI native because it benefits you. With the training and with the tools that are out there right now, staying on the cutting edge of what is possible in your own job is going to make you the most future-proof worker of tomorrow. And these tools are not tomorrow. They're today. They're here today.

[01:03:28] Markus Goldstein: Awesome. Thank you.

[01:03:30] Temina Madon: I would like all of us representing civil society to think about how we make sure smallholder farmers learn how to prompt AI and use AI effectively, mothers at home taking care of children. We need to ensure, as a community of academics, of philanthropists, of nonprofits, that we make the decision about how AI is used and who knows how to use AI. And that's why we are trying to get deployments out, and with rigorous evaluation, deployments of AI to people who have otherwise been overlooked, because it's on us. I don't think governments are going to swoop in and teach people how to use AI effectively. I think most companies probably don't have the incentive to invest in their workers and that they're using AI effectively. So unfortunately, for better or for worse, it's on all of us to make sure that we are developing the skills to use this technology effectively and bend it toward human development, because I don't think that happens without us. So if you are in a foundation and thinking about how to use AI, make sure it serves the people on your team. If you are building within a ministry of health, ideally, you're thinking about building not for your program, but for the people, building co-pilots, solving people's problems, rather than trying to eliminate them. And again, I think that's a shared responsibility that we all will have to adhere to.

[01:04:58] Gabriel Demombynes: Okay. So much. Yeah. How do I... Yeah. Let me just... Two things. First, on the job point, which is a huge topic, I think one thing to say is that there's just huge uncertainty as to what's going to happen with jobs as a consequence of AI. I do have a paper on AI exposure for jobs in developing countries. The main conclusion is that whatever happens with jobs, it's going to happen much more slowly in the developing world than in high-income countries, because there's just lower AI exposure overall. I also wanted to raise another issue, which has been on my mind a lot lately, is that this whole conversation has been focused on particular, narrowly tailored, well-designed applications for AI. And at the same time, we know that out in the real world, AI use is just exploding. And we certainly know that from our own experience personally, but that's also happening in the developing world. So in developing countries, people are already using AI chatbots to try to learn things, to get medical advice, to look for jobs. And this was sort of driven home for me a few days ago, and I should mention that we have one of the evaluations that we're funding is an evaluation training teachers in Peru to use AI in different ways. So this is a carefully designed evaluation with, I think, 300 schools with an RCT. And then I saw just a few days ago, the OECD published the latest TALIS survey, which is a survey of teachers around the world. And one of the questions they had in the survey was, are you using AI for your teaching? And Peru is not in the TALIS, but Colombia, Brazil, a bunch of other middle-income countries are. And in all of those countries, more than half of teachers said they were already using AI. And this was in 2024. So now in 2025, it must be even higher. So that, I mean, it's a couple of things there. One is that the control group for our study in Peru is clearly contaminated in some sense. There already is a lot of AI use happening, even in the control group. So that's going to change how we interpret our treatment effects out of that study. But it also makes me worry that we're sort of missing the bigger story about how AI is already transforming human development outside the boundaries of our interventions.

[01:07:21] Han Sheng Chia: Great.

[01:07:21] Markus Goldstein: Okay. On that note, round of applause for the panelists, please.

[01:07:31] Han Sheng Chia: Thank you.

[01:07:32] Markus Goldstein: And we're off.

Topics

Technology and Development

AI for Global Development

From Prospective to Prepared Teacher: A Global Study of Initial Teacher Education

Second Annual Research Conference on Global Lead Exposure

Oct

14

2025

HYBRID

9:30—10:30 AM ET | 2:30—3:30 PM GMT

CGD Office

2055 L St NW

5th Floor

Washington, DC 20036

EVENTS | CGD ANNUAL MEETINGS EVENTS

Evaluating AI: What’s New and Why the Development Sector Should Care

Opening Remarks

Panelists

Moderator

Topics

Latest Commentary

Blog Post

Reconciling Sovereign Debt Reform Proposals with Impact and Reality

Blog Post

The Developing World’s Jobs Crisis Was Here Before AI

Blog Post

Amendments to the Code Regulating Health Worker Mobility Were Adopted at the World Health Assembly. Now What?

Blog Post

Tax Reform in the Middle East After Conflict

Latest Research

BRIEF

Health Taxes and the IMF

CGD NOTE

25 Years Later: Income Transition and Health System Progress in Low-Resource Settings

WORKING PAPER

Iron and Calcium Supplementation for Reducing Blood Lead Levels

Events

From Prospective to Prepared Teacher: A Global Study of Initial Teacher Education

Second Annual Research Conference on Global Lead Exposure

Oct

14

2025

HYBRID

9:30—10:30 AM ET | 2:30—3:30 PM GMT

CGD Office

2055 L St NW

5th Floor

Washington, DC 20036

EVENTS | CGD ANNUAL MEETINGS EVENTS

Evaluating AI: What’s New and Why the Development Sector Should Care

Opening Remarks

Panelists

Moderator

Topics

Latest Commentary

Blog Post

Reconciling Sovereign Debt Reform Proposals with Impact and Reality

Blog Post

The Developing World’s Jobs Crisis Was Here Before AI

Blog Post

Amendments to the Code Regulating Health Worker Mobility Were Adopted at the World Health Assembly. Now What?

Blog Post

Tax Reform in the Middle East After Conflict

Latest Research

BRIEF

Health Taxes and the IMF

CGD NOTE

25 Years Later: Income Transition and Health System Progress in Low-Resource Settings

WORKING PAPER

Iron and Calcium Supplementation for Reducing Blood Lead Levels