BLOG POST

Cutting Through the Noise: Early Insights from the Frontier of Nonprofit AI Use

From March 3-5, over two dozen nonprofits and funders convened in Bengaluru to kick off the AI for Global Development Accelerator which funds and provides technical support to nonprofits using AI to increase the scale and cost-effectiveness of evidence based, agency-enhancing interventions. The convening was organized by the Agency Fund and Project Tech4Dev, in collaboration with OpenAI, and experts at the Center for Global Development.

While the week was anticipated to be highly technical, I arrived in Bengaluru with the lens of a former policymaker and funder hoping to gain insights on big questions like, “How likely are some of these nonprofits to scale with AI? Is the technology helping them meet the needs of their beneficiaries? How worried should we be about AI hallucinations and other errant behavior? How far away are they from assessing cost-effectiveness with an impact evaluation?” This blog identifies key insights for funders and policymakers from three days of technical presentations by program developers.

1. While there’s growing debate in the US about whether the world is nearing AI that surpasses human intelligence, many nonprofits are already delivering services at scale with current AI technology.

Technology companies have long claimed that artificial general intelligence (AGI)—systems that match human cognitive capabilities—is imminent. Recently, influential figures like Ezra Klein and Biden White House AI Advisor Ben Buchanan have echoed this view. Yet, while AGI debates continue in the US, nonprofits are already delivering meaningful AI-powered services at scale without AGI. For instance, Accelerator grantee Jacaranda Health, which delivers digital coaching to expecting and new mothers, is already handling approximately 12,000 inquiries daily. Another grantee, Digital Green, is currently supporting close to 150,000 farmers through digital extension services. While AGI could dramatically improve these services, meaningful AI powered development interventions are already being deployed at scale.

Figure 1. Jacarada Health aims to use AI to screen user's needs and escalate to a human run help desk during an emergency.

Graph showing the frequency of questions from mothers

Reproduced with permission of Jacaranda Health. Diagram credit: Emmanuel Olang, Jacaranda Health

As exciting as many of these scaled deployments are, these nonprofits are still refining their retention strategies. Designers of digital services know that initial adoption is distinct from sustained user engagement. What will the retention funnel look like, and what benchmarks should be applied—metrics from comparable digital apps or from the non-AI versions of these programs? Even if these AI-powered applications achieve user retention at scale, will they outperform non-AI versions in rigorous impact evaluations? Will future, more advanced models further enhance impact? These remain critical open questions, but early data supports the hypothesis that AI applications can be used at scale despite resource limitations like the lack of internet connectivity, limited smartphone access, and nascent AI literacy among users.

2. Accelerator participants are betting AI-driven personalization can boost impact.

REACH, another accelerator participant working on maternal health, provides real-time pregnancy support and health information to women via WhatsApp/SMS. Even before introducing AI, their service allowed users to browse over 340 pregnancy-related topics, receive antenatal appointment reminders, and connect with support agents. This product is already deployed at scale, reaching 63 percent of all pregnant women attending their first antenatal appointment in South Africa. REACH hypothesizes that integrating generative AI will enable more dynamic, personalized conversations, leading to users finding the service more valuable, increasing engagement and adoption of healthier behaviors at scale.

Figure 2. REACH hypothesizes that more personalized communication with users will lead to greater retention on their platform and ultimately a higher adoption of health seeking practices.

Line graph showing correlation between the number of users and number of days on the platform

Diagram reproduced with the Permission of REACH. Diagram credit: Daniel Futerman

The ambition for increased personalization is common across multiple use cases. For instance, in the agriculture sector, Precision Development and Digital Green are using generative AI to provide tailored agricultural advice to smallholder farmers, while education nonprofit Youth Impact aims to deliver targeted numeracy assessments and instruction by phone. Rocket Learning, an early childhood development nonprofit expects from a previous study that personalization of nudges can boost parental engagement with a child on their homework by 5 percent compared to uniform nudging—an expectation that’s broadly in line with the nudge literature. Though development funding often operates in sector silos, this accelerator may identify common AI-driven personalization strategies and products that cut across sectors.

3. Policymakers considering AI-powered interventions worry about inconsistent AI behavior, but techniques exist to measure and manage such unpredictability.

Application developers I met were clear eyed that AI technology can behave inconsistently—generative AI models rarely meet intended behaviors "out of the box." The critical question for developers isn't whether the technology deviates from the designer’s intent, but whether their mitigation strategies sufficiently ensure quality outcomes, particularly for vulnerable populations. For example, a healthcare nonprofit building an AI-driven hotline must ensure the application accurately distinguishes between casual conversation and multiple types of medical inquiries in order to provide an appropriate response. They must therefore consider how effectively AI systems classify user intent, and how concerned they should be about potential aberrant behavior.

Figure 3. Jacaranda Health's approach to integrating generative AI to screen user questions and provide timely assistance over voice and text. Reproduced with permission of Jacaranda Health.

Illustration showing Jacaranda Health's approach to integrating generative AI

Diagram credit: Emmanuel Olang, Jacaranda Health

At the Bengaluru workshop, I saw developers applying methods to constrain unintended AI behavior like fine-tuning, retrieval-augmented generation to enable AI models to search knowledge bases, and function calling, and evaluating their effectiveness. Several collaborators and I will detail these approaches and how to evaluate them in an upcoming blog. My key takeaway is that evaluations quantifying how frequently models deviate from intended behaviors in specific contexts can provide policymakers greater comfort. While details and context matter, policymakers may find these error rates acceptable, especially given AI’s speed and cost advantages—and particularly if baseline human performance was already limited. For example, an AI hotline distinguishing small talk from genuine medical inquiries might show error rates comparable to call center staff operating in resource-constrained settings.

The common narrative describes AI as a “black box,” whose behavior we don't fully understand. While broadly accurate, developers can still evaluate how often and under what conditions an AI application meets intended performance. Given such visibility, policymakers and funders could start requiring vendors to provide model evaluation metrics highlighting rates of behavior deviation. This would shift discussions beyond general fears about AI hallucination and inconsistency, enabling clearer quantification of tradeoffs.

Figure 4. Volume of queries by agricultural topic and month on Digital Green's Farmer.Chat application.

Volume of queries by agricultural topic and month on Digital Green's Farmer.Chat application

 [Source]

Additionally, the digital nature of many of these interventions means that the program is highly observable in real-time. This can mean greater auditability of interactions between users and service providers. It also means that program implementers and policymakers can have greater visibility into user needs and deliver more timely interventions. For example, Figure 3 above shows how interactions over Digital Green’s digital platform can be quickly coded to show that farmers and extension workers in Kenya have the most engagement with the tool in March during peak growing season, and are most concerned about pests and disease during that period.

Accelerator participants are exploring whether generative AI can enhance the cost-effectiveness of their interventions at scale. Though it's too early to measure impact on development outcomes, they hypothesize that AI-driven personalization could boost user retention and impact. Additionally, despite the potential for erratic AI behavior, organizations are implementing monitoring and correction tools to manage and mitigate such risks.

Disclaimer

CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.


Image credit for social media/web: Poco_bw/ Adobe Stock