We investigate the properties of measures of learning outcomes, as these are the tools commonly used to monitor the progress toward identifying the most effective interventions. We review test properties across 158 studies and conduct item-level psychometric analysis of a subset of these studies to show that current tests vary widely in scope, content, administration, and analysis. Researchers rarely provide details about the properties of their test scores. Only 4 percent of studies we review provide reliability estimates of their tests, and 10 percent archive item-level replication data to evaluate test quality post hoc. The interpretation of any estimates is necessarily sensitive to the measurement of the core variables, even where treatments are randomly assigned. Since estimates of treatment effects are often expressed in standard deviation units, measurement error can bias treatment effects toward zero. Content analysis of question wordings reveals substantial variation in content coverage of the skills tested, even when students of similar grades are being tested in similar subjects. The findings indicate that comparisons of treatment effects must consider degrees of measurement error that are often unavailable and the content breadth of the tests to contextualize why effects may differ on substantively different outcome variables.
Rights & Permissions
You may use and disseminate CGD’s publications under these conditions.
Image credit for social media/web: Adobe Stock