Can We Benchmark Development Agencies on Impact?

and

June 05, 2025

Here, we put forward a potential framework for measuring the extent to which development agencies are utilizing high-impact approaches—moving beyond process evaluations toward outcome measurement. By examining how agencies' largest programs align with evidence-based "smart buys" this approach could inform more targeted aid allocation both within agencies and in choices between multilaterals. We propose piloting this methodology in selected sectors to test its feasibility before considering broader application, and we're actively seeking feedback and partners interested in collaborating on this initiative.

Why does this matter?

Understanding which agencies use evidence-based approaches matters for multiple stakeholders. For bilateral agencies, it provides critical feedback on their performance and resource allocation. And for partner countries and ultimate beneficiaries, it could mean the difference between programs that transform lives and those that merely look good on paper.

The difference between effective and ineffective approaches can be stark. In primary education, the Global Education Evidence Advisory Panel's "Smart Buys" report reveals dramatic variations in return on investment. "Structured pedagogy" programs that provide teacher guides, student materials, and proper training can yield learning gains equivalent to 3-4 years of high-quality schooling per $100 spent. In contrast, investments in computer hardware without complementary software or teacher training show virtually no impact on learning despite costing substantially more—sometimes exceeding $100 per student annually with no measurable benefits. This represents an enormous opportunity cost that education systems can ill afford.

What do we know already?

Existing benchmarks for development agencies—such as OECD peer reviews, MOPAN assessments, and CGD's QuODA (Quality of Official Development Assistance)—provide valuable insights and have driven important reforms in aid practice. But these assessments often rely on processes rather than outcomes. They evaluate factors like allocation of aid to countries most in need, knowledge management systems, and financial resource efficiency—all necessary but perhaps insufficient elements of effective aid management.

The OECD's Development Assistance Committee (DAC) Peer Reviews examine members every 5-7 years, focusing mainly on strategy alignment, management processes, and policy coherence rather than development results. MOPAN (Multilateral Organisation Performance Assessment Network) evaluates multilateral organizations through stakeholder surveys and document reviews, but emphasizes organizational systems over concrete impacts. QuODA compares donors on quantitative indicators such as aid transparency, priority-setting, and country ownership—valuable metrics, but not direct measures of development impact.

But a theory of change requires a clear explanation of how activities lead to outcomes and, ultimately, long-term impact. These frameworks do not assess that critical element. While evidence exists at project and program levels through impact evaluations and value for money assessments, no comprehensive framework benchmarks agencies on their actual development impact.

How much do we know about what works?

"Smart buy" lists that identify cost-effective interventions are increasingly available across various sectors. These typically build on systematic reviews that factor in both outcome effects and costs. Organizations like WHO, World Bank, 3ie, and J-PAL regularly produce such evidence.

In some sectors, comprehensive processes have already generated authoritative smart buys lists—such as the Disease Control Priorities (DCP-3 and DCP-4) in health and the Global Education Evidence Advisory Panel (GEEAP) in education. In other sectors, the evidence exists but remains less systematically compiled. We encourage donors to close these gaps by building on existing resources like 3ie evidence gap maps to create comprehensive smart buy resources. We include some specific examples of high-impact approaches at the end of this blog.

While evidence-based "smart buys" provide valuable guidance, three important limitations warrant attention:

Interventions should begin with partner country priorities, particularly where they relate to government services.
The value of any "smart buy" depends critically on contextual gaps; tuberculosis screening that shows remarkable cost-effectiveness in one setting may be redundant in another where such services are already adequately provided.
We acknowledge the inherent "measurability bias" in evidence-based approaches. Encouragingly, even traditionally qualitative sectors such as governance and female empowerment now benefit from quantitative assessment frameworks.

How could we assess whether agencies are using these approaches?

The fundamental question for an impact benchmark is straightforward: to what degree does an agency support programs that rank among the most cost-effective interventions (smart buys) in their respective fields?

Unlike process evaluations like MOPAN and QuODA, which can leverage existing databases (such as DAC OECD systems), an impact benchmark would need to start from scratch. This presents challenges in both identifying definitive smart buys lists and assessing programs against them. CGD colleagues have been working on an initial attempt to assess providers use of cost-effective health aid, and others are exploring the feasibility in education.

Given the complexity, a focused approach is essential. This could include the following tactics:

Focusing initially on the largest development agencies
Targeting only the thematic sectors receiving the highest ODA volume (beginning with a pilot in one or two domains)
Examining (initially) only the top ~20 programs funded per agency by volume

Each selected sector would be evaluated by a panel of experts with deep knowledge of the impact evaluation literature. The assessment could employ Delphi methodology (a structured method to elicit expert opinion), heavily weighted toward existing systematic reviews based on robust impact evaluations. This approach ensures that recommendations are grounded in comprehensive evidence while minimizing new research requirements. If program details are lacking, those projects could be excluded; or we could consider that, if they don’t mention a high impact approach, they are unlikely to be using one.

Key evidence sources would include the World Bank’s Smart Buys in Education, Disease Control Priorities (DCP-3), 3ie systematic reviews, VoxDev Literature Reviews, and J-PAL policy insights. Other “best buy” papers exist—at the UK’s FCDO, for example—but are largely unpublished. The work could also integrate existing studies of how development programs compare against a cash benchmark, where interventions achieving higher cash multiples are identified as leading programs (as employed at USAID, for example).

An initial analysis identifies two viable implementation pathways, each with distinct advantages. The first recommended approach prioritizes simplicity and expert assessment, while an alternative leverages machine learning to expand analytical scope.

Approach 1: Expert assessments

Our preferred approach would be to systematically evaluate each donor's largest interventions. Evaluators would first identify the 10-20 largest programs for each donor using the International Aid Transparency Initiative (IATI) registry, which standardizes reporting across major donors. These programs typically represent a significant portion of a donor's portfolio and offer meaningful insight into their strategic priorities and evidence-based programming approach.

Experts would compare each identified program against the top 10 most impactful interventions in the relevant sector, as determined through the Delphi consultation. This comparison would use a nuanced grading system that recognizes varying degrees of alignment with evidence-based practices. Programs precisely matching proven interventions would receive 100 percent alignment scores, while those incorporating key elements but differing in some aspects might receive 75 percent or 50 percent scores, depending on alignment extent. (A practical limitation here is that IATI data may not show what share of a program’s budget goes to smart buys when programs include multiple elements. Additional data—perhaps from agency databases—would be needed to fill this gap.)

The final output would quantify smart buys adoption for each major donor (e.g., "20 percent of largest programs qualify as “smart buys"). To address political sensitivities around potentially low adoption rates, the final presentation could translate these percentages into categories like "very good," "good," etc., providing a more positive framing while maintaining accountability. These category thresholds could adjust over time as smart buys adoption increases.

Crucially, this approach would incorporate multilateral contributions into the assessment framework. For bilateral donors like Germany, Sweden, or the United Kingdom, the evaluation would consider both direct bilateral programs and contributions to multilateral organizations, reflecting the significant aid channeled through these institutions. Multilateral organizations themselves would undergo separate assessments, creating a coherent evaluation framework capturing the full spectrum of development assistance flows.

Approach 2: Quantified assessments

An alternative methodology would leverage quantified assessments, either keyword matching or LLMs (large language models), to analyze a much broader project range—potentially the top 100 or even 1,000 projects from each donor and identify those likely to be employing high-impact approaches.

This study on cost-effectiveness in global health can serve as an inspiration for the feasibility of matching projects against cost-effectiveness in DCP-3 through keyword matching.

Beyond keyword matching, LLM analysis might be an alternative. The system could train LLMs on expert-validated examples of highly effective interventions, creating a matching algorithm capable of identifying programs aligning with evidence-based practices. The automated screening would examine project descriptions, objectives, and methodologies from the IATI database, comparing them against templates of proven interventions developed through expert consultation.

This approach offers several advantages, notably the ability to assess a much larger portion of each donor's portfolio and potentially identify effective programs that might be overlooked in a more limited manual review. However, it would require careful validation and a pilot phase to ensure assessment accuracy.

The two approaches could also be effectively combined for optimal results. The expert-only process would act as the first pilot, establishing methodological rigor and generating a validated dataset of assessments. This foundation could then be scaled up with support from keyword matching or LLMs, enabling a much broader analysis once the system is calibrated with expert judgments.

These measures on the prevalence of high-impact studies could also be complemented with a framework that factors in other important variables (like a country's income level, recognizing that resources deployed in the poorest countries often yield greater impact, as dollars stretch further and complement limited state capacity). This supplementary approach would remain distinct from the core methodology but could provide valuable additional context when comparing intervention effectiveness across diverse economic settings. For example, QuODA (mentioned above) could add measures on impact alongside prioritisation, transparency, and ownership to give a fuller picture.

The way forward

A credible benchmark of agency impact could drive meaningful reform by creating clearer incentives to maximize the use of proven approaches. By moving evaluations toward outcome measurement, we can help decision-makers improve their resource allocations and focus limited development resources to deliver the greatest possible benefits for the world's poorest and most vulnerable populations.

We hope to pilot this methodology in one or two thematic sectors to test its feasibility; identify issues; and determine whether it yields sound, actionable conclusions before considering broader application. We're seeking partners interested in collaborating on this initiative. Please email us at [email protected] and [email protected] if you’d be interested in partnering.

Notable examples of high-impact programs

Poverty reduction: BRAC's Ultra Poor Graduation approach offers a scientifically proven path out of extreme poverty by combining cash transfers with productive assets, training, and social integration. Successfully implemented in over 50 countries, it's recognized by the World Bank as one of the most promising sustainable poverty reduction models.

Global health: Health taxation on harmful products like tobacco and alcohol generates revenue while reducing consumption, with WHO classifying this as a "best buy." Vaccination program consistently rank high in effectiveness evaluations.

Education: "Teaching at the Right Level" addresses the mismatch between student knowledge and standardized curricula, systematically eliminating educational barriers and significantly increasing the effectiveness of education investments.

Agriculture and nutrition: Strategic investments in agricultural research show exceptionally high returns with broad geographic impact, strengthening food security while promoting climate resilience, with USAID investments yielding 8x returns.

We are grateful to Biniam Bedasso, Rachel Glennerster, Lee Crawfurd, Tom Drake, Basil Müller Pascal Roelcke, and Johanna Wicke for comments on an earlier version of this blog. All views and any errors remain those of the authors.

Topics

Measuring Development Effectiveness

QuODA

Aid Effectiveness

Future of Development Agencies

DISCLAIMER & PERMISSIONS

CGD's publications reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions. You may use and disseminate CGD's publications under these conditions.

Thumbnail image by: Jonathan Torgovnik/Getty Images/Images of Empowerment

From Prospective to Prepared Teacher: A Global Study of Initial Teacher Education

Second Annual Research Conference on Global Lead Exposure

BLOG POST

Can We Benchmark Development Agencies on Impact?

Recommended

Blog Post

A Practical Way to Get More Out of Limited Foreign Assistance Budgets

Blog Post

QuODA 2021: Aid Effectiveness Isn’t Dead Yet

Why does this matter?

What do we know already?

How much do we know about what works?

How could we assess whether agencies are using these approaches?

Approach 1: Expert assessments

Approach 2: Quantified assessments

The way forward

Notable examples of high-impact programs

Topics

DISCLAIMER & PERMISSIONS

Events

From Prospective to Prepared Teacher: A Global Study of Initial Teacher Education

Second Annual Research Conference on Global Lead Exposure

BLOG POST

Can We Benchmark Development Agencies on Impact?

Recommended

Blog Post

A Practical Way to Get More Out of Limited Foreign Assistance Budgets

Blog Post

QuODA 2021: Aid Effectiveness Isn’t Dead Yet

Why does this matter?

What do we know already?

How much do we know about what works?

How could we assess whether agencies are using these approaches?

Approach 1: Expert assessments

Approach 2: Quantified assessments

The way forward

Notable examples of high-impact programs

Topics

DISCLAIMER & PERMISSIONS

More Reading

Blog Post

FCDO’s Best Buys Deserve a Larger Audience

Blog Post

CGD Podcast: Inside the 2025 Commitment to Development Index with Ian Mitchell

POLICY PAPER

Effectiveness In Practice

Blog Post

Beyond Aid Cuts: The Forgotten Policies That Still Matter for Development

Sign up to get UK Development Policy updates: