You’ve designed a program to help more children learn to read, or to reduce the number of women who die in childbirth, or to increase how much wheat farmers grow. You pilot the program. You even invest in a careful evaluation. It works. Fantastic news! You realize that this could benefit so many more people than just the participants in your little pilot, so you convince a government agency to scale it up, or maybe you get funding to scale it up yourself. But at scale, the promised results fail to materialize. So what happened?
Interventions that are effective at scale are the golden nuggets of public policy: valuable, rare, and even apparent winners are often revealed to be fool’s gold. They can be so challenging to find that you could be tempted to throw up your hands and say that “Nothing scales!” While that’s an overstatement (many interventions have had positive impacts at scale), there are also many, many failures. What factors drive the drop (or, in some cases, disappearance) of impacts as programs go from pilot to scale, and how can we avoid them?
New insights on an enduring problem (from 57 perspectives!)
Two recent books seek to add to this conversation. Last year’s The Scale-Up Effect in Early Childhood and Public Policy: Why Interventions Lose Impact at Scale and What We Can Do About It (edited by List, Suskind, and Supplee) and this year’s The Voltage Effect: How to Make Good Ideas Great and Great Ideas Scale (by John List). While the first is ostensibly focused on early childhood, the principles—and even some of the chapters—have broad application to public policy programs. The books overlap in their focus (scaling effectively) but differ enormously in their tone and audience. The former is moderately technical, geared towards researchers, program implementers, and maybe a bold policymaker. The latter seeks to reach a general audience, highlighting the basic principles of scaling through stories.
The Scale-Up Effect, an edited volume with 25 contributions, revolves around a framework for understanding why we so often observe a drop in effect size (what implementation researchers call a “voltage drop”) when going from pilot to scale. In their chapter, Al-Ubaydli and others lay out four threats to scaling:
Pilots may appear to have an impact but don’t actually (called a “false positive”). The pilot participant group just happened to have better outcomes than the beneficiary group, not because of the program. It happens once in a while.
The program may have changed. Often, when programs go to scale, they’re more sloppily implemented (it’s easier to carefully monitor a program 100 clinics than 10,000) or they’re implemented with fewer components (the government can’t afford the full pilot package at scale). If the program changes, then the effect might too. Pilots are often implemented in communities or schools or hospitals where the leadership are excited about participating in pilots (and innovation generally). That may not scale nationwide.
The people who receive the program may be very different in a pilot versus a full-scale implementation. If you pilot a daycare program to the kids who are most in need (for example, the ones who might otherwise be babysat by a television), you might get very different impact than if you offer it nationwide, which would include a lot of kids who otherwise would be at a daycare center that their parents would pay for.
The program may have different impacts once lots of people are receiving it. For example, in a country where there are very few engineers, an engineering program might have a large impact. But once lots of engineers are trained, training more may have a much smaller effect. One example, not from the book, is education in India: while one estimate of the additional earnings from a year of education are about 13 percent, once lots of people were getting an education, those returns drop by nearly half.
One clear demonstration of these effects at work is in the scale-up of a highly effective early childhood development (ECD) program in Jamaica, which involved home visits to encourage cognitive stimulation. The program, implemented with under 70 children, led to large positive impacts—both economic and non-economic—later in life. Araujo and others (Chapter 11; open-access version) explore why the effects dropped by more than two-thirds when implemented among 700 children in Colombia and then even further when implemented among 70,000 children in Peru. There are several possible culprits for the voltage drop. It could be a statistical problem (related to #1 above—maybe more sensitive child development measurement instruments were used in Jamaica), or it could be due to a change in the population of beneficiaries (related to #3 above—the Jamaica trial may have reported just on people who received the home visits, whereas the others included those who were offered it and turned it down). But the most likely culprit seems to be the quality of implementation (#2 above): training and supervision for providers both grew weaker as the program scaled, together with growing administrative and procurement complexity.
The first half of List’s solo-authored book, The Voltage Effect, focuses on these exact traps but illustrates them with a wide range of real-life examples, all told in refreshingly plain language. For example, as an example of a false positive (the problem listed first above), List recounts his experience advising the CEO of the car company Chrysler. They piloted an employee wellness program, which initially showed positive impacts on employee absenteeism and other measures of well-being. But a second, third, and fourth pilot showed no impacts. If they had scaled based solely on the results of the first pilot, many resources would have been wasted.
Much of the strength of List’s book lies in the breathtaking breadth of his experiences. He was an economic advisor for a U.S. president, ridesharing companies Uber and Lyft, the airline Virgin Atlantic, Chicago Public Schools, and the government of the Dominican Republic. He opened a preschool. He drove a forklift for a food gift company. (Of course, he did almost all of these things in collaboration, and he’s quick to share credit.) For economist readers, part of the pleasure is just seeing how widely the tools of our trade can be applied. For general readers, he mines these experiences to demonstrate the pitfalls in scaling and how to avoid them.
Where do we go from here?
It’s fine to know what can go wrong with scaling, but what should readers take away from all this? In The Scale-Up Effect, the recommendations for researchers fall into two camps: first, conduct research built for scale; and second, build partnerships with policymakers. Much of the book is focused on specific recommendations related to the former: for example, select your beneficiary and implementer sample with a large-scale population in mind (Davis and others – Chapter 8; Stuart – Chapter 14); gather and report detailed data on exactly what was implemented (Ioannidis et al. – Chapter 7); figure out how much it all costs (although that’s often harder than it looks—Barofsky and others’ commentary), and use measurement tools that can be scaled (McConnell and Goldstein – Chapter 16). But the various contributors also repeatedly emphasize the importance of partnerships between researchers and policymakers, with advice on how to nurture those relationships over time (Carter et al. – Chapter 19; open-access blog post version) and during transitions in policy administrations (Pappas’ commentary).
Yet both books offer much more than counsel to researchers. In The Scale-Up Effect, there’s advice on how to think about descaling bad programs (Chambers and Norton – Chapter 15), advice on how to get parents to take-up programs—even if the government scales an early childhood program, it won’t work if people don’t use it (Gennetian – Chapter 4), and so on. The volume ends with separate lists of recommendations for researchers (design studies with an eye toward scaling), policymakers (compare the context of the successful pilot study with your population and implementation ability), program leaders ("facility practitioner and community buy-in”), and funders (fund studies that inform scaling and “implementation with an eye toward scaling”) (Kane et al. – Chapter 22; open-access version). (I’ve also prepared a brief summary of each chapter or other contribution in The Scale-Up Effect.) The Voltage Effect covers a range of principles beyond the pitfalls to scaling—incentives, marginal thinking, the value of quitting, and the importance of scalable cultures (aka, what went wrong at Uber)—all illustrated with colorful, real-life tales.
While this is all crucial, it doesn’t answer the separate question of how you convince someone to scale a program (and finance it) in the first place. Here and there, we glimpse the role of politics in all this, such as when Araujo and others highlight that “political pressures led to very rapid expansion targets (often at the cost of quality).” Young and Terra (Chapter 21) provide the only detailed blueprint of how to build the political will to scale a program—stepping aside, for a moment, from the question of effect sizes. In 2003, when Terra was the Minister of Health for the Brazilian state of Rio Grande do Sul, his state launched a home visiting program. Then, in 2016, when Terra was a minister at the federal level, the national adaptation of the program was launched. By 2020, it was reaching nearly a million people. How did they get there? It took more than a decade of building evidence, fostering experts within the government, and putting supportive policies and funding into place. In short, it takes a lot to get to scale.
Both of these books highlighted to me a tension between the value of repeated pilots (often in different settings) to make sure a program is likely to work at scale (Chapters 3, 6, 7 and elsewhere) and the fact that government policymakers often have to make difficult scale-up (or scale-down) decisions without lots of high quality replications available (Osborne – Chapter 20). List demonstrates this contrast by sharing his experiences in the private sector (e.g., Chrysler, Lyft, and Uber) and in the ECD center he co-founded where he was able to implement experiments rapidly and then observe action taken based on the results; and then recounting how some of his careful economic analysis in the U.S. government led exactly nowhere due to bureaucracy, entrenched interests, and more. The policy perspectives in The Scale-Up Effect—detailing the lengthy process of building support, the challenges of changing administrations, and the fickleness of financing—were particularly welcome on this front.
I recommend both books. For researchers (like me), The Scale-Up Effect provides lots of practical advice and points to further resources on how to improve evaluations and policy partnerships to maximize impact at scale. Select chapters of The Scale-Up Effect will be of interest to broader groups of readers. For funders and policymakers, The Voltage Effect does more to provide intuition on the potential pitfalls and solutions in scaling. (For researchers, The Voltage Effect still adds value, both in providing simple, clear rhetoric and examples to communicate these concepts and because the range of experiences is fascinating.)
There are so many things that can go wrong at scale that even these two insightful volumes can’t completely dissipate the mystery. But they both provide valuable tools to evaluate pilot programs in a way that gives the best odds for scale-up and then to avoid the major pitfalls in the process of getting to scale.
This post also appears on the Development Impact blog. This review is better due to comments from Emma Cameron, Ranil Dissanayake, and Maya Verber.
CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.