Will Raising Test Scores in Developing Countries Produce More Health, Wealth, and Happiness Later in Life?

Introducing CGD’s new Return to Learning Initiative, tracking long-term outcomes from education experiments

It’s rare to read an education report that doesn’t mention the learning crisis. As data on low learning levels have emerged in recent years, global education aid has swung its focus sharply toward improving test scores among primary school children. Of course, learning to read is a good thing in its own right. But when competition for scarce education resources is fierce, does this focus on foundational literacy pay off in terms of increased earnings, health, or other dimensions of well-being in the long term? At present, we do not have a fully credible answer to this question.

By the mid-2010s, over a dozen successful randomised control trials had been conducted on programs aiming to raise reading and mathematics scores among primary school students in the developing world. Which means that children in some of those early trials are now in the labour market. If they can be successfully tracked a decade or more later—a big if—they can start to answer a very basic question with obvious policy importance: if you learn more in second grade, do you earn more later?

We don’t really know if teaching kids in, say, rural Liberia to read faster or earlier is going to make them any richer

There’s evidence from rich countries to suggest that cognitive skills are important determinants of future earnings and that higher skilled workers, with the same level of schooling, do earn more. We also see kids who are supported to develop skills in primary schools go on to benefit from these later in life. Well-designed early childhood programs can have long-lasting positive effects on children too, even where short-term impacts appear to fade.

But does the same relationship hold across countries? Some of the correlational evidence from low- and middle-income countries suggests maybe it does, although that is typically for skills measured in adulthood, not while at school. Longitudinal data from northwestern China have been used to show that childhood cognitive skills have strong explanatory power for the wages of adults in their late 20s, even after controlling for years of education.

But we also have observational evidence from poor countries suggesting that the answer might be different. Recent work by Jishnu Das, Abhijeet Singh and Andres Yi Chang, across five low- and middle-income countries shows that higher test scores at age 12 are associated with higher college attendance at age 22. But this relationship appears weaker than, say, the predictive power for test scores on college attendance in rich countries. In lower-income countries, pupils’ family wealth has a bigger role in their eventual educational attainment. Other studies in individual countries (Peru, Tajikistan, Uzbekistan, Colombia) show mixed impacts of cognitive skills on adult outcomes.

Long-term tracking should allow us to rule out (or not) other reasons for scepticism about the value of early-grade reading and numeracy programs. For instance, fade out is a major concern. Learning gains in second grade might just be gone by fourth grade. So is the possibility that the correlation between test scores and income is entirely spurious, driven by unobserved family wealth and social networks. Another worry is that labour markets just don’t reward academic performance, and kids aren’t learning skills relevant to the real world.

Positive findings from long-term tracking could potentially put all of those fears to rest, showing that learning gains have real causal impacts on real-world outcomes. But it’s not going to be easy.

Challenge #1: Tracking kids. This year we’re running pilots in Mali, Liberia, Ghana, and Uganda to test our ability to find kids in old experiments that weren’t designed for long-term tracking.

The obvious approach to studying the impact of basic skills on adult outcomes would be to start a new round of randomised trials on promising programs to improve learning outcomes, build in strong tracking protocols, and wait for those kids to grow up. We should certainly do more of this. And for researchers who are interested, J-PAL and others are opening up opportunities for funding and guidance to do so.

But we’re impatient. The policy questions around investments in basic education are urgent, and we don’t want to wait another ten to fifteen years to get started. Going back to past trials, conducted a decade or more ago, gives us a jumpstart.

The downside is that many of these trials were never designed for long-term tracking. At the most basic level, it’s worth distinguishing two types:

  1. Studies that collected pupil names and tracked them over the (often short) duration of the original trial—longitudinal studies.
  2. Studies that relied on repeated cross-sections of pupils, never tracked them over time, and often never recorded their names at all, forcing us to re-sample from the relevant cohorts of children in both treatment and control schools—cross-sectional studies.

During the pilot, we’ll be going back to the original study schools in both types of studies to establish whether we can track the children (now adults) who originally participated, with an initial (very ambitious) target of finding 90 percent of kids.

Challenge #2: Picking studies. We’re interested in measuring the long-term benefits of better reading skills, which leads us to focus on trials that showed gains – but that comes with risks.

(This section is a bit technical, so feel free to skip down to the list of pilot studies for the final answer!)

To be able to measure the long-term impact of better reading skills it’s important that we follow up at least *some* trials which worked. Our research question here is not whether a particular teaching method run by a particular agency improves literacy outcomes, in which case null effects on learning are valuable and interesting. Instead, we are interested in whether better reading and maths skills lead to better life outcomes.

In econometric terms, the original experimental treatments are an instrument that we can use to study the causal impact of learning on adult life outcomes (on the assumption that, say, improved pedagogy in second grade doesn’t improve your life chances through any other channel except teaching you more basic skills). Studies with no initial effect on test scores amount to weak instrumental variables, with an insignificant first stage. If used to study the impacts of learning on adult outcomes, these weak instruments can lead to severe bias.

But we should acknowledge a countervailing risk: by focusing only on studies that ‘worked’, i.e., trials showing short-term learning gains, we could exaggerate the effect on long-term earnings. It’s well known that the average estimate after dropping statistically insignificant studies is much larger than the average true effect. While creating a significance hurdle for the first stage is less obviously bad than a significance hurdle for the final results, it’s potentially still problematic.

A lot hinges here on how much of the variation across trials is real (i.e. some programs genuinely worked better than others) and how much is chance. If programs really do vary, following up on the good ones makes sense; if the variation is spurious, this selection is less justified.

It also matters if a ‘lucky’ draw in an RCT that produced a spuriously big treatment effect on learning means the same sample will generate a big effect on adult earnings. If the noise in the data relates to family wealth, for instance, we might be worried this will also affect long-run outcomes; if it’s due to measurement error in test scores, we might be safer. 

So far our simulations show the likely bias from using weak instruments is considerably larger than the bias from dropping insignificant studies, but we’re still working on this and welcome feedback.

Challenge #3: Statistical power. Few studies have enough power to pick up reasonable effects on adult earnings, so we hope to pool multiple trials.

We’ve built a database of more than 30 evaluations of interventions that sought to improve foundational literacy that could be viable for follow up. To achieve the goal of this project we need studies where sufficient time has lapsed for us to measure adult outcomes (i.e. interventions from 2013 or earlier) and we need a large sample size and a high effect size to give us enough power to detect long term effects. This narrows down the pool considerably. We’ve selected four studies (Table 1) from the pool—two longitudinal and two cross-sectional—that meet these criteria and we plan to start some piloting this summer, with big thanks to the original study teams.

Table 1. Pilot studies selected



Program dates


Effect size on learning

Children’s mean age in 2023

Were kids originally tracked over time?

Adrienne Lucas, Patrick McEwan, Moses Ngware and Moses Oketch, 2014




0.18 sd



Benjamin Piper and Medina Korda, 2011




0.80 sd



Annie Duflo, Jessica Kiessel, and Adrienne Lucas, 2022




0.13 sd



Jennifer Spratt, Simon King, and Jennae Bulat, 2013




0.25 sd



Note: interventions evaluate effects on several literacy skill components. Here we report for Oral Literacy (Uganda), Oral Reading Fluency (Liberia & Mali) and Foundational Questions (Ghana). The Ghana study involved four different intervention arms, and a control, with 100 schools in each arm. Here we report the combined effect across interventions.

Note, however, that all of these studies were designed to detect reasonable effect sizes on learning outcomes. Moving further down the (hypothesized) causal chain to earnings and other adult outcomes, effects are likely to get noisier, and the requisite sample sizes will be considerably bigger. In short, our initial power calculations suggest no single study here is powered to pick up effects on labour market outcomes.

The solution, we hope, lies in pooling across studies to combine sample size and statistical power. There’s a tension there though with an earlier point above; if the true economic return to learning gains varies across contexts, this pooling is less helpful. So more studies in more contexts allowing us to disentangle what’s real variation and what’s statistical noise will be needed.

The four pilots this summer are just that, a pilot. We hope to add more studies over time. If you know of other good candidates, let us know!


CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.

Image credit for social media/web: Ignacio Ferrándiz/Adobe Stock