The Parable of the Visiting Impact Evaluation Expert

September 16, 2013

Visitans Perito works at the World Bank as an education specialist, and has just set off on a two week mission to the country of Peripheria, a poor, land-locked former Soviet Republic in Central Asia, about which he knows very little, except that everyone seems to agree it has a totally dysfunctional public school system.

Perito's assignment is to help the Peripherian Ministry of Education design its new Five Year Plan to Achieve High Quality Universal Primary Education, 2014–2018.  The key moment of his trip is a face-to-face meeting with the Minister, who pitches him on a new vision for the Five Year Plan that she would like the World Bank to help fund. 

After several minutes of pleasantries and formalities, the Minister makes her pitch. 

"The government of Peripheria has decided to hire 50,000 new civil service teachers," she says, “to fulfill our president’s promise to raise our test scores to the level of Western Europe by 2018. We have listened to your sermons about evidence-based policy and hired an expert local consultant who did a regression analysis of our most recent Annual Primary Examination scores." She flips through a copy of the report on her desk and reads aloud: "After controlling for socio-economic characteristics, our findings imply that Peripherian public schools with class sizes 10 pupils below the national average of 63 pupils per teacher will score 0.7 standard deviations higher on the APE.' I think you will agree, this sounds like a promising investment for the World Bank, no, Mr. Perito?"

"Well as you know, Madam," Perito replies, trying not to sound too pedantic, "correlation is not causation. There are many unobserved factors that determine exam performance, and with all due respect, I sincerely doubt your local consultant's regression has identified a causal relationship. To get at causation, we need a randomized controlled trial or a clever natural experiment that accounts for these unobservables. Luckily, economists in America and Israel [gated] have --" 

"America and Israel, eh?" 

"Yes, they've already done really rigorous, well-identified econometrics using RCTs and regression discontinuity designs — I can explain if you'd like — but they've shown conclusively that reducing class sizes by 10 pupils only raises scores by only about 0.27 standard deviations. Given the cost of teachers, I’m not sure the math works out here."

"Hmm," the Minister grunts, displeased.

Perito decides to try a different tack. "Perhaps the Honorable Minister has read the recent DFID white paper on support for low-cost private schools. We at the Bank feel there is a lot of good research behind this proposal." 

"Da," she replies, mumbling something in Peripherian. "They think we should support these disgraceful private schools that are popping up all over the place, and start offering—how do you call them—vouchers?"

"Exactly. My colleagues at the World Bank have actually published a study in the American Economic Review. It's a very prestigious journal, evaluating a successful voucher program in—”

"Wait a second Mr. Perito. I think you don't understand the private schools here in Peripheria. They recruit these fundamentalist religious parents who resent our secular state curriculum; all they teach them is religion. They are at the bottom of the rankings on our national exams each and every year." 

"I understand Madam, but these anecdotes and correlations can be deceiving without a rigorous evaluation design. As I was saying, my colleagues at the Bank, working with leading academics, have shown that Colombia's secondary school voucher program had a clear causal effect of about 0.2 standard deviations."

"Ah, very interesting," the Minister interjects, nodding approvingly. Perito thinks to himself he’s beginning to persuade her. "My nephew is applying to study engineering in Columbia. Perhaps you could put in a good word for him?"

The non-parable part of the story

Lant Pritchett and I recently released a new CGD working paper, "Context Matters for Size: Why External Validity Claims and Development Practice Don't Mix." The main point of the paper is that the Minister is wiser than Mr. Perito thinks, and Mr. Perito knows far less about what will raise test scores in Peripheria than he's been led to believe. The Minister is committing the textbook fallacy of inferring causation from correlations in observational data—assuming that because Peripherian public schools with smaller classes score better and private schools score worse, this should guide her investment decisions. Mr. Perito is falling into an opposite fallacy: assuming that his evidence from America, Israel, and Colombia producing well-identified causal effects from experimental and quasi-experimental studies is a good guide to what will work in the (potentially) very different context of Peripheria.

Figure 6 from the paper, reproduced here, shows why. Look at the current literature on these two questions—class size reductions and private schooling—and compare two potential sources of error (MSE or mean-squared error): bias from hastily inferring causation from correlational (i.e. non-experimental) studies, or bias from applying rigorous causal findings to wholly different contexts. For probably 99 percent of policy questions, it's impossible to know which risk is greater. But we picked on a couple of questions with multiple well-identified econometric studies from different contexts, allowing us to take a God's-eye view and judge which risk is worse.

The answer is unambiguous: across multiple methodologies we find a consistent pattern, whereby errors from extrapolating across context are far greater than errors from using 'less rigorous' methodologies within the correct context. (See the paper for more explanation of what's going on in the figure.)

 In short, there is just no substitute for local knowledge.

What does this mean for people doing and using impact evaluations?

Much of the impact evaluation practice within aid agencies like the World Bank, DFID, etc., exhibits Mr. Perito's tendency to draw broad generalizations from the best studies, prioritizing clean causal inference over relevance. This tendency is increasingly built into the structure of evaluation summaries and even funding opportunities. Here are four examples:   

1.      Evidence rankings that focus exclusively on the methodology used for causal inference, ignoring where the evidence comes from. For instance, the US Department of Education’s What Works Clearinghouse disqualifies any study that is not randomized from receiving its full seal of approval, "Meets Standards of Evidence".  

2.      Meta-analysis of impact evaluations of a given policy across diverse contexts that focus on distilling a single average effect. This approach, observed in systematic reviews sponsored by development organizations like DFID and 3ie, implicitly assumes that the huge differences in effects that one observes across contexts are just random noise, and the Universal Truth is hidden somewhere beneath the data.

3.      Funding decisions that put all of the proverbial research eggs in one basket.  That makes sense if and only if you think that we can generalize broadly from these highly concentrated studies. The whole concept behind initiatives like the World Bank's Strategic Impact Evaluation Fund—which has sponsored many of the most interesting evaluations in the industry—is that resources can be deployed to a handful of interest cases to produce lessons which are a global public good.  We're less convinced of the portability of those lessons.

4.      Grand policy conclusions. In our experience, international aid bureaucrats are eager to define “best practices'' on an international scale, and are too quick to transplant findings across contexts. But researchers often encourage them. In a famous example, Abhijit Banerjee and Ruimin He proposed a list of proven interventions from randomized and quasi-experimental studies which, they argued, the World Bank should scale-up globally—including reduced class sizes, based on the Israeli study cited in the parable. Again, we’d object to a single list of proven interventions across diverse contexts in the messy business of development policymaking.

So what's the alternative?  That's another paper—literally. Lant, Salimah Samji, and Jeffrey Hammer have another paper where they propose an alternative approach to impact evaluation based on much more experimentation in development economics, all within the context of interest and run by the government agency or NGO interested in implementing the project. For now, suffice it to say that we think context matters much more than economists running impact evaluations often suppose. In the future, how we deploy evaluation resources and the kinds of claims we make about our results should reflect that reality.


CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.