With the US Congress considering cuts to foreign assistance and aid budgets in other donor countries coming under increased pressure, evidence about what works in global development is more important than ever. Evidence should inform decisions on where to allocate scarce resources—but to do so, evaluations must be of good quality. The evaluation community has made tremendous progress on quality over the past decade. Several funders have implemented new evaluation policies and most are conducting more evaluations than ever before. But less is known about how well aid agencies are evaluating programs.
To fill in the gap, we—together with our colleagues Julia Raifman Goldberg, Felix Lam, and Alex Radunsky—set out to assess the quality of global health evaluations (both performance and impact evaluations). We looked specifically at publicly available evaluations of large-scale health programs from five major funders: USAID, the Global Fund, PEPFAR, DFID, and IDA at the World Bank. We describe our findings in a new CGD Working Paper and accompanying brief. Check out the brief recap of our findings below.
What types of evaluations are aid agencies conducting?
We identified a total of 299 evaluations of global health programs published between 2009 and 2014. One feature stood out to us: performance evaluations made up an overwhelming majority (91 percent), with impact evaluations accounting for less than 10 percent. This is comparable to the share found across USAID evaluations in all sectors by an earlier study. And among impact evaluations, those using experimental methods, known as randomized controlled trials or RCTs, constituted a minority (we only found five RCTs). When looking at evaluations commissioned or conducted by major funders, the often-made criticism that RCTs are displacing other forms of evaluation doesn’t hold up.
How well are aid agencies evaluating global health programs?
We randomly sampled 37 evaluations and applied a standardized assessment approach with two reviewers rating each evaluation. To answer questions about evaluation quality, we used three criteria from the evaluation literature: relevance, validity, and reliability. We considered evaluations as relevant if the evaluation addressed questions related to the means or ends of an intervention, and used appropriate data to answer those questions. Evaluations were considered valid if analyses were methodologically sound and conclusions were derived logically and consistently from the findings. Evaluations were considered reliable if the method and analysis would be likely to yield similar conclusions if the evaluation were repeated in the same or similar context.
We constructed four aggregate scores (on a three-point scale) to correspond with these criteria. Overall, we found that most evaluations did not meet social science standards in terms of relevance, validity, and reliability; only a relatively small share of evaluations received a high score.
Looking across different types of evaluations, we found that impact evaluations generally scored better than performance evaluations on measures of validity and reliability.
What can aid agencies do better going forward?
Building on our analysis, we developed 10 recommendations for aid agency staff overseeing and managing evaluations to improve quality.
Classify the evaluation purpose by including this information in the title and abstract, as well as coding/tagging categories on the agency website.
Discuss evaluator independence by acknowledging the evaluators’ institutional affiliation and any financial conflicts of interest.
Disclose costs and duration of programs and evaluations.
Plan and design the evaluation before program implementation begins; we found that early planning was associated with higher evaluation quality.
State the evaluation question(s) clearly to ensure the right kinds of data are collected and an appropriate methodology is used.
Explain the theoretical framework underlying the evaluation.
Explain sampling and data collection methods so subsequent researchers could apply them in another context and readers can judge the likelihood of bias.
Improve data collection methods by using purposeful or random sampling, where possible, that provide more confidence in findings.
Triangulate findings using varied sources of qualitative and quantitative data.
Be transparent on data and ethics by publishing data in useable formats, and taking appropriate measures to protect privacy and assure confidentiality.
This set of recommendations draws on the high-quality evaluations we found in our sample. These examples showed that it is possible to conduct good quality evaluations for a range of methodologies and purposes. In many cases, quality improvement is possible within existing budgets by planning early or using better data collection approaches. Taking steps to improve quality can help ensure evaluations promote learning about what works and hold funders and implementers accountable—with an eye on increasing value for money and maximizing development impact.