There's been a lot of back and forth between Jonathan Morduch and me on the one hand and Mark Pitt and Shahidur Khandker on the other over their influential non-randomized study that finds that microcredit reduced poverty in Bangladesh. The latest entry in this perennial series came last month as a World Bank working paper by Pitt and Khandker.
As you would expect when the same people write about the same things, the new paper shares much with its predecessors. It is written emphatically, seeming to raise profound concerns; I don't find it that persuasive; and yet it has taught me something. Perhaps the most succinct rebuttal is that the paper does not refute the fact that if you drop the 16 data points with the most extreme (highest) values on household spending, less than 0.5% of the sample, the finding that microcredit increases household spending completely goes away. The new paper spends much more space challenging our hypotheses (what PK call "claims") about why this happens than whether it happens. But for real-world implications, the whether matters more than the why.
Here, I will explain my thinking in more detail. I'll try to keep section openers jargon-free, but no promises otherwise. If this post bewilders you, then you will know how I felt when I decided years ago to understand the debate over PK, which Jonathan initiated.
Section 1 of the new reply is an introduction. Section 2 does not deal with substance.
Section 3 criticizes an alternative to PK's main statistical method, which we use in part of our paper. The alternative was first proposed by PK in 1998 (footnote 16) and first applied to the PK data by Pitt in 1999 in his attempt to rebut Morduch (1998). PK now describe the method as "outlandish" and "extraordinarily artificial." To me, the theoretical critique and the demonstration through simulations both appear flawed. And even if they are correct, they mainly go just to that "why" question, not the "whether."
PK (1998) propose estimating impacts via two-stage IV in which the instruments are interactions between included controls and each of the two dummies for female and male credit availability. Pitt (1999) implements this as 2SLS. We do classical linear LIML instead, solely in order to stay conceptually closer to PK's nonlinear LIML. But this distinction matters little. In fact the associated under- and weak identification tests, whose availability substantially motivates use of the estimators, are identical. Both also have the virtue of being known to be robust to non-normality in the errors. PK's theoretical attack on the method they once used to defend themselves can I think be distilled to this:
Consider the system of structural equations:
y = x1 + x2 + x3 + e
x1 = z1 + u1
x2 = z2 + u2
where e, u1, u2 are potentially correlated error terms; x3 is an exogenous control; z1, z2 are uncorrelated with e; and z1, z2 are strong explanators for x1 and x2, making them strong instruments for x1 and x2 in the y equation. 2SLS is appropriate for estimating the coefficients on x1 and x2; it would put x3, z1, and z2 in the first-stage equations for x1 and x2. However, the exogenous control x3 is a weak instrument because its expected coefficients in the first-stage equations are 0. (Its first-stage coefficients correspond to πfx and πmx in the PK (2012) exposition.) And z1 is a weak instrument because its expected coefficient in one of the equations, the x2 equation, is zero; and vice versa for z2. (These correspond to πfm and πmf.)
The above argument is wrong because a) the exogenous control x3 cannot be a “weak instrument"; b) z1 and z2 are collectively strong for x1 and x2 even if z1 is weak for x2 and z2 is weak for x1.
PK attempt to illustrate their contention that the linear estimators are fatally flawed by weak instruments. However, these and subsequent simulations deviate from the PK estimation framework in two ways that greatly and unrealistically weaken instruments in themselves. First, treatment quantity (how much is borrowed) is simulated with a zero-centered distribution rather than being always positive as in the real data. As a result, treatment averages zero for treated and untreated, and the availability of credit by gender, the key instruments in PK, are perfectly weak. (In the data and code file, in Table1&9groups3a_liml.do, lines 22--33 define teh zero-centered female and male treatment quantities.) And second, also unlike in PK, these dummies enter as controls rather than instruments! In a classical treatment impact assessment, this is equivalent to looking at the impact of treatment while controlling for rather than instrumenting with intent-to-treat. This too should weaken the remaining instruments. (In the same .do file, note the appearance of "treatm treatf" in the second stages of the specifications in lines 71 and 83.)
I think what this first simulation set actually demonstrates is a problem PK don't emphasize, and which I of all people should have taken more on board. The linear approach generates a lot of instruments, which causes overfitting bias, toward OLS. I think when we revise, we should add exactly-identified regressions, instrumenting only with the credit availability dummies, not their interactions with the controls. Tests on a Pitt (1999) simulated data set demonstrate the minimal bias (but inefficiency) of this method (see Appendix of our first paper). I think it was a mistake not to apply it to the real data earlier.
Exactly identified LIML or 2SLS regressions (they coincide), which are relatively free of overfitting bias and robust to deviations from normality, show no impact of microcredit on household consumption. See 3rd and 5th columns, which are new:
Section 4 helpfully spots a bug in our code. It then devotes almost 3 pages to its potential implications---rather than fixing it to see if it makes a difference, and rather than acknowledging that I informed Mark Pitt this summer that it doesn't.
I forgot to factor in the sampling weights in the lines that compute the skew and kurtosis of the second-stage errors and test whether they deviate from normality. The fix actually strengthens our findings of non-normality: skew in errors in the replication regression rises from 0.64 to 0.71 and kurtosis from 4.78 to 5.12. (Add "[aw=weightpk]" clauses to the "sum ey, detail" and "sktest ey" lines in this.) This is to be expected since PK undersampled ineligible (less poor) households and overweight the sample to compensate. This accentuates the long right tail in the household consumption data.
All code has bugs. And most bugs once found can be made to look really dumb, especially by a good lawyer. If in one place in the complicated computer program for your study of childhood obesity, you typed an H instead of a W: you used height instead of weight. Equating height and weight goes against millennia of medical science, not to mention common sense. One can demonstrate with theory and simulations how it can produce all sorts of wrong results. And no justification is provided for this strange theoretical construct!
Seemingly, it would be easier to just fix the problem and see if it matters. Seemingly this is also the thing to do if one's priority is getting the science right.
Section 5 confronts our finding that the PK estimation method is bimodal, tending to generate two contradictory results---microcredit increases or reduces household spending. The section argues that such bimodality is neither unusual nor a problem. The argument here seems flawed in three ways, one minor and two major.
The minor flaw is that the computer simulation code that shows the normalcy of bimodality doesn't actually demonstrate the presence of two local maxima the way ours does (see Figure1conloop3negt.do in the code and data file). The code loops over various possible impact coefficients for female and male credit, equating the two, each time maximizing the likelihood while constraining the estimated impacts to these values. It is a graph of constrained likelihoods over a subset of possible values for the constrained parameters. But I fix this is and found that the two peaks indeed correspond to true local maxima when the likelihood search is unconstrained. So the double-peaked graph is conceptually flawed but meaningful.
The first major problem was already mentioned. The simulations unrealistically deviate from the PK set-up in ways that weaken the instruments. One deviation is in the hypothesized data-generating process: amount borrowed averages zero. The other is in the estimator: credit availability is a control rather than instrument. When these two problems are fixed, the bimodality goes away. Compare this to PK's double-humped Figure 1:
So instead of challenging Jonathan and me by showing that bimodality is the norm, PK's simulations corroborate us by associating bimodality with econometric degeneracy.
The other major problem is an elision of the distinction between bimodality in the likelihood and bimodality in the estimator. It is absolutely the case, as PK say, that ML does not require unimodality of the likelihood for consistency. If a particular mode is highest with probability 1 as sample size goes to infinity and the ML search always detects this mode, the estimator will be consistent. However, multimodality in the estimator is inconsistency prima facie. It's a matter of matter of definition, not theory. (Some narrow counterexamples for completeness: the estimator could be asymptotically bimodal such that the mass of all but one mode goes to zero in probability, or such that the modes become infinitely close. But the data do not suggest such scenarios.)
And we do present evidence of bimodality in the estimator, via bootstrapping. Using the best method we've found for detecting multiple modes, we found 65% of the mass of the ML estimate of the impact of Grameen lending to women to be below zero. But I recently discovered a subtle bug in that code: the right number, I now believe, is 36%:
So a one-tailed test of whether the impact is positive yields significance only at p=0.36. PK address our bootstrapping only in footnote 26, where they say it "lacks any econometric justification." For justification, they can refer to authoritative texts.
It's possible that the estimator is asymptotically stable, contrary to our finite-sample evidence. Perhaps 5,218 observations on 1,798 households is not large enough for asymptotic behavior to kick in. That brings us to theory. Econometric theory tells us that the PK estimator is consistent when the assumptions implied in its likelihood are correct, notably normality of the errors. It may well be robust to certain violations of these assumptions, such as the non-normality we detected, but no one has proven as much. Thus PK are correct that "RM’s concern with bias [actually, inconsistency] arising from non-normality draws no support from econometric theory." But the opposite holds: neither does theory reassure. And I thought that in econometrics, estimators were presumed inconsistent until proven consistent. It takes chutzpah to defend an estimator by saying no one has proved it doesn't work.
PK offer some hand-waving about why their estimator is probably robust to non-normality. The arguments are reasonable, but illustrate the danger of hand-waving, for they are wrong. These simulations demonstrate. We are correct that the classical linear estimators are strictly more robust to non-normality than PK's, and thus provide a useful check.
Section 6 returns to the linear estimators that PK first proposed and used (and which again relate primarily to the "why," not the "whether").
There is some semantic confusion here---linear LIML is not a way of doing 2SLS, and parameter and moment covariance matrices are not the same. But the main upshot is that our first linear regression, in which there are six instrumented variables, credit by gender and lender---is underidentified. It is "the equivalent of sirens blaring and red lights flashing to proclaim that something is terribly wrong with the estimation." Actually, our paper notes the underidentification (see the under-ID test in the table above) and does not rely on that regression for inference. We fix it by pooling credit by lender, which eliminates the underidentification, as shown above (the p values on the test plunge to 0.000).
Moreover, as that table newly shows, and contrary to PK, instrument weakness is not an irreducible source of trouble. In the exactly identified estimates, the instruments appear strong; the impact of microcredit on poverty still does not. It appears now that the instruments are not weak in the overidentified regressions, at least those that pool across lenders. Rather, instrument proliferation is distorting the test of instrument weakness. Adding instruments should increase instrument strength even if the test of strength says opposite. This finding does contradict our earlier thinking a bit, and I'll return to it.
The section also provides an unusual interpretation of the linear estimates, combining the point estimates from LIML with the "perfectly valid" standard errors from 2SLS. In fact, the LIML and 2SLS regressions return the exact same weak instrument diagnostics, so it's not clear why one is more valid than another. At any rate, it is an unorthodox move and a thin reed on which to rest a defense of PK.
Section 7 is interesting. It strips away components of the PK estimator to isolate the source of identification. What it still does not do, however, is use the formal language of probability to state and defend the conditions needed for identification of causal effects. PK, for example, have never motivated the assumption that variation in the availability of credit by gender is exogenous. They also have not explained why credit availability is a good instrument in a nonlinear IV set-up despite our demonstration (Table 4) that credit availability is correlated with the second-stage error.
Contrary to appearance, the new PK paper confirms rather than refutes the conjecture that bimodality in the PK estimator is a sign of weak instrumentation. The paper does not overturn the finding that dropping a handful of systematically picked outliers collapses the two modes into one near zero. It does not change the fact that linear estimators that are robust to demonstrated deviations from the likelihood model produce estimates close to zero. It does not change the fact that the PK estimator is demonstrably inconsistent in the face of such deviations. It does not address the bootstrap evidence that the estimator is inconsistent on the real data.
But we have learned from this round. Most important is the discovery of a paradox: PK now provide a laboratory demonstration of how weak instruments make their estimator bimodal, confirming one of our hypotheses; yet the friction with them led me to run exactly identified linear regressions that revealed the instruments to be strong in that context, cutting against our hypothesis.
This forces us to revise whatever tentative insight into the nonlinear PK regressions that we derive from the linear analogs. I would not now hypothesize that the PK instruments are weak in the usual sense, across the full sample. Nevertheless, as Jonathan and I noted in 2011, the PK result disappears when dropping villages where both genders can borrow and, symmetrically, persists strongly when restricting to just those villages. So the PK result seems to emanate from this subsample, in which the female and male credit availability dummies are identical, making them weak for explaining distinctive variation in the endogenous variables, credit uptake by gender. It seems as if instruments being weak only within a subsample, while irrelevant for linear estimation, can distort a nonlinear one---at least when there are outliers. This is why I continue to conjecture that the outliers and instrument weakness are interacting: fix either and the bimodality goes away.
Perhaps someone else can formulate a sharper explanation for the instability of the PK estimator; our conclusions about the credibility of the PK findings remain regardless.