A few years ago, one of us (Abhijeet) wrote a cautionary note that the “standard deviation” measure in education evaluations in development economics was not typically comparable. That note was based on the informal observation that tests in development evaluations seemed to vary a lot from one study to the next. And, unlike the armies of psychometricians who design international assessments such as PISA, most development economists aren’t trained to create tests—or if they’re using a pre-existing test, to validate if it works in a new context. Yet, if the hundreds of evaluations of education reforms need to add up to a coherent whole, then we need some way to express effect sizes on a common scale (at least approximately).
In a new CGD paper, we do a deep dive to figure out what the state of current measurement is in leading evaluations of education reforms in development economics. We looked at 158 studies, published in leading economics journals, to evaluate systematically what types of tests researchers administered, and what their (reported) psychometric properties are. Disappointingly, very few studies report any details about their tests. Only about 10 percent archive the item-level data that it would take to establish validity post-hoc (many authors did helpfully provide it after email requests). So, for a subset of studies where we obtained item-level data, we estimated test reliability for each test booklet separately (N=223) and classified each test item (N=5,994) to see what, exactly, studies were measuring.
What we find is a mixed bag…
The good news is that tests are (mostly) internally consistent. Most test booklets show reasonable levels of test reliability with a strong degree of internal consistency (a Cronbach’s alpha over 0.7 or 0.8). While these are a bit lower than the typical levels of 0.9 on standardized assessments in the US, they are still reliable for the general purposes of most evaluations. But there are test booklets at all levels from very low reliabilities of 0.2, all the way to over 0.9. Quite a few test takers show floor effects, i.e., where students answer all items incorrectly. Unfortunately, except for a few studies, you would never know how reliably their core outcome variable is measured.
The more mixed news is that the current body of evidence has very little in common across the whole literature in what is measured, from whom, and how. Students are of very different ages, tests vary widely in what they assess (in terms of specific competencies within broad areas such as math or language), they differ in how they are scored, and they differ in how they were administered (e.g., written or with oral stimuli). All of these features can really affect student responses. Before setting out on this project, we had hoped we could use the item-level data to create a more robust link across studies retrospectively. This does not seem to be possible—the underlying measures and samples just differ too much. It could be possible to do more linking by having new children sit multiple tests (similar to what Dev Patel and Justin Sandefur did in this paper), but even that might be a tough ask for this now-sprawling literature. If we really want comparability in this literature, this will need to be built in at the design stage (with all the coordination issues that entails across unconnected research teams…).
The other take-away, confirming our initial intuition, is that measuring effects in standard deviations might be a good starting point for providing a summary measure of effect sizes, but it certainly isn’t enough by itself to make decisions about whether an effect is “big” or “small,” or whether a particular intervention is cost-effective or not, or whether the effect of one intervention is more or less effective than another intervention in a different sample and a different test. Both populations and tests differ too much for these comparisons. And, as Alex Eble and coauthors’ excellent paper from the Gambia shows, standard deviation metrics can be very sensitive in contexts with big floor effects.
Could we be doing better?
Some fixes are easy and could be done in individual studies both at the design stage and the reporting stage. In the absence of an already-accepted common standard, we propose some principles that could represent a starting point for discussion:
- Tests should be designed to measure a broad range of proficiency
- There should be a clear mapping of content at the item level to subdomains
- Items should come from multiple sources to allow for linking comparably with other samples and over time
- The test should lend itself easily to generating a smoothly distributed aggregate score (this is notably not the case for commonly used assessments such as the Early Grade Reading Assessment and the short ASER tests)
- Wherever possible, assessments should be piloted in the context that they will be administered, and with a larger item bank than is intended for final use
What to report in papers
- How tests were designed, administered, scored, and scaled,
- What the psychometric properties of the aggregate test scores are
- Present effects not just on aggregate test scores, but also on specific competences and subdomains
- Express treatment effects in relation to other meaningful magnitudes in the context, such as the gap in socio-economic status or the magnitude of learning under business-as-usual in that setting
- Researchers should also archive item-level data and not just the aggregate test scores
Many of these recommendations aren’t very different from the blog post from 2015 we linked to at the top. That is just because, even though some reporting may have improved, much of how we approach measurement in this literature has remained the same. The need, then as now, is for common measurement tools appropriate for different ages, mapped to an open-access item bank of test questions, from which researchers could borrow for their own specific studies. This is the kind of public good that could only be provided (and maintained) by a large organization such as UNESCO or the World Bank. The technical work is feasible—PISA and TIMSS and NAEP have done it for many different populations for decades—but academic research teams have neither the incentives nor the permanence to create these public goods. For those who set up and fund research infrastructure though, this should be a priority.
Read our new paper here.
CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.
Image credit for social media/web: Adobe Stock