Econometric studies that seek to draw conclusions about effectiveness from data that span large geographical areas or highly varied populations thus typically have lower levels of internal validity, but higher levels of external validity. So, once again, the fundamental issue is not the purity of the methodology employed (as exciting as such methodological purity is to the technically inclined) but rather the inherent complexity of the world being studied.I'll explain. The main challenge for non-randomized studies is that to infer the impact of, say, microcredit, they need to assume that the treatment and control groups are statistically the same---or would be but for microcredit being available to or used by those in the treatment group. This assumption allows us to attribute any differences between the two groups to the impact of microcredit. But it generally isn't that credible. As a general matter, do you really think that people who take microloans are the same, statistically, as those who don't? Creative methods are often used to tweak the treatment and control groups into greater comparability (more generally, to isolate exogenous variation in treatment). But most of the methods are much less convincing than randomization. For example, matching methods, as I explained in 2009, try to line up treated individuals with untreated ones according to observed traits such as income and number of children. But if the father of one family of four that lives on $5/day goes for a loan and the father of another family of four that lives on $5/day does not, wouldn't you wonder whether, despite the outward statistical similarity, the two men have different personalities and abilities; and whether the borrowing one is more likely to be entrepreneurial and to succeed even without the loan? And wouldn't that make it look as if microcredit was helping even when it wasn't?
Control groups in theory correct for the attribution problem by comparing people exposed to the same set of conditions and possible choices. However, control-group design is tricky, and skeptics hover like vultures to pounce on any weakness.— Elisabeth Rhyne, 2001But suppose a researcher formed 1,000 such pairs and for each pair flipped a coin to decide who got offered a loan. That assignment of treatment and control should make the two groups match, statistically. For example, the groups should have about the same number of entrepreneurial fathers. That would be an RCT. No longer would we need fret about treating groups of people who are probably different as if they weren't. Cartwright:
One thing we certainly can do is to try to take into account all possible sources of difference between the test and target populations that we can identify. This is just what we do in matched observational studies. When it comes to internal validity, however, advocates of the exclusive use of RCTs do not take this to be good enough—matching studies are not allowed just because our judgements about possible sources of difference are fallible.Then comes Cartwright's zinger: Turns out that RCTs don't get you around the problem of treating different people as if they are the same. After a study is run, you need to generalize from it. And what does an immaculately conceived study of the impacts of microcredit in Manila or Hyderabad tell us about the impacts in Accra or Lima? Nothing, unless you are prepared to make assumptions about the ways in which people in different places are the same:
Economists make a huge investment to achieve rigor inside their models, that is to achieve internal validity. But how do they decide what lessons to draw about target situations outside from conclusions rigorously derived inside the model? That is, how do they establish external validity? We find: thought, discussion, debate; relatively secure knowledge; past practice; good bets. But not rules, check lists, detailed practicable procedures; nothing with the rigor demanded inside the models.So while RCTs may be superior within their confines---"internally valid"---the process of generalizing from them remains fraught with precisely the difficulties RCTs are supposed to solve. We cannot avoid these difficulties because the derivation of general conclusions from specific results is essential if the RCTs are to be part of science. It's also essential if they are to be useful. In sum, while non-randomized methods have problems of comparability within, randomized methods have them beyond. RCTs avoid messy questions about who to equate to whom during implementation only to slam into those questions upon interpretation. I was struck by this symmetry when I first grasped it. As Phil writes, the "inherent complexity of the world being studied" cannot be dodged. But on reflection, I don't think researchers and those who depend on them are as elegantly trapped as this picture suggests. For that to be the case there would need to be an unavoidable and binding trade-off between randomization and breadth of sample. In Cartwright's favor, I imagine that it is less practical to run a pan-Indian RCT than do a non-randomized study of data from nationally representative surveys, as Angus Deaton has. To the extent that work like Deaton's is more practical, it offers an approach to knowledge that may not be as internally valid as RCTs but is easier to extrapolate to all Indians. Cartwright gives a compelling U.S.-based example:
Consider the widespread correlation between low economic status and poor health, and look at two opposing accounts of how it arises… Epidemiologist Michael Marmot from University College London argues that the causal story looks like this: Low status--> ‘stress’ --> too much ‘fight or flight’ response --> poor health. In contrast, Princeton University economist Angus Deaton suggests this: Poor health --> loss of work --> low income --> low status Deaton confirms his hypothesis in the National Longitudinal Mortality Study (NLMS) data. He reasons: if the income-mortality correlation is due primarily to loss of income from poor health, then it should weaken dramatically in the retired population where health will not affect income. It should also be weaker among women than men, because the former have weaker attachment to the labour force over this period. In both cases these predictions are borne out by the data. Even more, split the data between diseases that something can be done about and those that nothing can be done about. Then income is correlated with mortality from both—just as it would be if causality runs from health to income. Also, education is weaker or uncorrelated for the ones that nothing can be done about. Deaton argues that it is hard to see how this would follow if income and education are both markers for a single concept of socio-economic status that is causal for health.But logically, there is no inherent trade-off between randomization and breadth of the sample. Economist Roland Fryer just won a MacArthur "genius" award for work that includes "a randomized experiment with well over 20,000 students from more than 200 schools in three cities" in the United States. Why couldn't a national microcreditor work with researchers to conduct experiments on a national scale? In India, the spread of high-tech systems for insurance and identification and voucher distribution may make cheap, large-scale experiments feasible. Moreover, I think Cartwright does not acknowledge that there is something fundamentally different about experiments, something for which there is no counterpart in large-scale, non-experimental studies. Experiments introduce novel variation. Randomized ones introduce variation that is effectively uncorrelated with everything else in the universe. That is special. I think that is why, as I reflect on the available studies of the impact of microcredit, Cartwright's dualism doesn't jive. True, the best non-randomized studies are more geographically diverse and representative than the best randomized ones. The Pitt and Khandker studies use data from 87 villages sprinkled among 29 of Bangladesh's 391 subdistricts (as enumerated in 1991). Yet for reasons elaborated elsewhere, I don't believe these studies convincingly measure the impacts of microcredit. They aren't as persuasive as the Deaton example above. (Perhaps this has something to do with the particular difficulties of evaluating an intervention, which Deaton does not do in the example.) Thus the fact that the Bangladesh studies draw data from across the country, that they more externally valid, is paltry compensation. It tells me that microcredit has unknown impacts not just in those 87 villages, but in all of Bangladesh. When it comes to generalizing, if it's garbage in, it's garbage out. If the choice is between a study done in one place that I can believe and one done everywhere that I can't, then the choice is easy. Based on my experience, a more promising path than large-scale, non-randomized impact studies is randomized ones done in diverse locales.
CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.