[Note: A full response to Mark Pitt is now available.]
And so it continues: the scientific process, live on your screen.
Mark Pitt has replied to my preliminary reply to his response to my replication with Jonathan Morduch of Pitt's paper with Shahid Khandker, which was provoked by Pitt's reply in 1999 to Jonathan's replication of Pitt's paper with Shahid Khandker.
Naturally, I must reply.
Before you roll your eyes and click Close, let me pitch the next few paragraphs to you. (I'm at the Global Philanthropy Forum, where pitching is the dominant form of social interaction.) Thanks to Mark Pitt's correction of our mistake, this is the first time someone other than him has been able to run and scrutinize the headline regression in the much-discussed paper. That's bound to lead to some interesting findings. I'm sure the discussion won't end with this post, but I'm also sure it hasn't degenerated into obscure hair-splitting yet. Here, I will give a nontechnical summary, then descend into technicalities for the econometricians who are reading. Actually this could be the last chapter in the saga for a while because Mark just resumed his sailing circumnavigation of the globe.
In Impact Evaluation 101 you learn that you can't just compare people who have taken a "treatment" like microcredit to those who haven't. Sure, if the borrowers are doing better, maybe microcredit is the reason. But maybe the "treatment group" members were better off to begin with and that's why they were readier to borrow. That reverse causation could make it look like microcredit helped even if it did not.
Easily said and easily understood: but it means good impact research is not easily done. And that is why most of the early studies of the impact of microcredit are not to be relied upon. As Pitt and Khandker explain:
To the extent that program participation is self-selective, it is not clear whether measured program effects reflect, in part, unobserved attributes of households (such as ability, health, and preferences) that affect both the probability they will participate in the programs (and the extent of that participation) and the household outcomes (schooling of children, labor supply, and asset accumulation) of interest. It is important not only to measure the impact of these credit programs on household welfare, but to determine whether targeting of credit toward women really matters.
One reason Pitt and Khandker's study was important when it appeared in the late 1990s is that it made a credible claim to being more credible. It wasn't randomized like the studies you hear about today, but it was supposed to be the next best thing, a "quasi-experiment," which exploited "naturally" occurring, partly arbitrary variation in the availability of microcredit. Here is the intuition, simplifying: officially, households owning more than half an acre of land weren't poor enough to be eligible for microcredit. Yet families owning 0.49 acres and families owning 0.51 should have been basically the same, statistically speaking. Since the 0.49s could get credit and the 0.51s couldn't, they made a great treatment-control comparison, allowing "clean identification" of the impact of microcredit. If the 0.49s were earning and spending more, for instance, it was hard to see what could explain that other than microcredit. Pitt and Khandker:
In this paper, we suggest and implement a method that treats survey data on participation in group-based credit programs as though this participation were generated by an experiment, with access to group-based credit ‘‘randomly’’ allocated to one sex or another, and that controls for self-selection into the program by these ‘‘randomly’’ chosen household members.
Understanding that, here are my main points (preliminarily). Technical details come later:
- Unlike Pitt's first response, which pointed out an important mistake, his new one doesn't leave much of a dent in my view. It focuses on the statistical tests I updated the other week that support the idea that Pitt and Khandker's statistical strategy was not working. In other words, unlike the first response, it confronts our assertion that causality remains unproven---and leaves the assertion standing.
- There actually is a weakness in the causality tests I reported, which Pitt did not mention: weak instruments. (If you don't know what that means, don't ask; hat tip to Mark Schaffer.)
- I now see those statistical tests as secondary anyway. They are based on a separate though analogous statistical method. Now that Pitt has shown how to match Pitt and Khandker's original method, it is possible to focus on it directly, and this gives a lot more insight.
- Big Point #1: Pitt and Khandker's analysis is not actually a quasi-experiment. It is not mathematically anchored in the half-acre rule. To that extent, the paper therefore does not seem to deserve higher status than the studies that came before. This point is straight from Jonathan.
- If you anchor Pitt and Khandker's regression properly in the half-acre rule---if you do the quasi-experiment---the bottom-line impact finding goes away. In fact it may even flip back to negative. Jonathan said this too. But he did it with statistical methods that were simpler and quite different from Pitt and Khandker's, leaving it unclear what explained the differences in results. Now because of a cool program I wrote, we are able to make the key change to Pitt and Khandker's own regressions.
- Big Point#2: It appears that even if there is a quasi-experiment in Pitt and Khandker, that's not the source of the results. That too weakens the study's claims to superiority.
On Pitt's most recent criticisms (not so important)
The linchpin of my last post was a Hansen overidentification test on a 2SLS regression identified by the same assumptions put forward by PK. (For an exposition on PK's methods and the 2SLS analog see the first subsection of Roodman & Morduch.) Secondarily, I reported Sargan tests on regressions restricted to data from each of the three survey rounds. The thinking was that since Tobit assumes homoskedasticity, and, in clustering errors at the household level, PK assume error correlations only in the time dimension, within cross-sections, errors could be treated as i.i.d. The Sargan test is valid for i.i.d. errors and, because it makes this stronger assumption, more powerful. However, as I realized last week, and as Pitt points out, PK's use of sampling weights effectively introduces heteroskedasticity, invalidating the Sargan tests. So one must rely on the Hansen tests throughout. They are in the table row at the very bottom of that post. They corroborate the Sargan tests: the instruments fail the overidentification test of instrument validity.
I think Pitt's other criticisms have even less bite. Mostly, they seem to attack the first stage of the 2SLS regressions for lack of realism: it doesn't model the deterministic zero-ness of microcredit for households without access to it; it doesn't constrain the instruments for female borrowing to have zero coefficients in the equation for male borrowing, and vice versa. But these equations are reduced forms and don't need complete realism for consistency. In fact they are more consistent for their lack of realism since they, unlike the first-stage Tobits in PK, are robust to heteroskedasticity. The fact remains that the 2SLS regressions are consistent GMM-type estimates identified by PK's own stated moment conditions.
Pitt's point about GMM being preferable seems to make little sense. Probably he means doing two-step GMM with 2SLS as the first step. In this case, the Hansen J would be exactly the same as reported.
Last, Pitt introduces certain interaction terms into the PK regression's 2nd stage and shows that they have little explanatory significance. But those interaction terms are not the actual instruments, which are interactions between the X variables and the dummies for availability of credit to males and females. The latter are used in the 2SLS regressions, and a large subset of them, based on village dummies, does have explanatory power when added to the second stage, as we showed in the paper and in the last blog post. The "instruments" do seem like they should not be excluded.
All that said, there may be a problem with our 2SLS regressions. The instruments seem weak. If one aggregates the six credit variables into two, one for each gender, ivreg2 produces a Cragg-Donald F of about 4 and Stock-Yogo benchmarks that indicate weak instruments. That undermines the distributional assumption about the 2SLS estimator and perhaps the Hansen test too. Since PK's LIML regressions may be more robust than 2SLS to the equivalent of having many, weak instruments, it's not clear to me exactly how the concern about instrument weakness translates to the PK regressions.
On the (lack of) quasi-experiment
Pitt and Khandker are clear about the basis for their quasi-experiment:
The parameter of interest..., the effect of participation in a credit program on the outcome..., can be identified if the sample also includes households in villages with treatment choice (program villages) that are excluded from making a treatment choice by random assignment or some exogenous rule. That exogenous rule in our data is the restriction that households owning more than one half acre of land are precluded from joining any of the three credit programs.
But as Jonathan pointed out in 1999, many households above half an acre borrowed too, and are classified as de facto eligible (or "target") in the analysis. Thus in the PK regressions the treatment-control divide, what can be thought as defining "intent to treat," is often based on actual treatment, not the highlighted half-acre rule. The only way it can be a quasi-experiment, then, is if the Grameen Bank, BRAC, and the BRDB enforced some other treatment assignment rule which was substantially arbitrary at the margin. If they did, we don't know what the rule was. Pitt described it as "unknown" in his 1999 reply to Morduch. At any rate, PK claim no such assgnment rule.
It is understood that discontinuities in program implementation are often fuzzy for practical or humane reasons. But for PK to claim quasi-experimental status, their modeling of intent to treat needs to be based on a plausibly exogenous rule with a plausible relationship to the actual implementation rule. The only way we see to do that is to redefine target status in the regressions based strictly on the half acre rule. In practice, that means adjusting the target status dummy I infamously left out of the second stage; and adjusting the samples for the first-stage credit equations, which consist of those households quasi-experimentally offered credit. When we do that---when we perform what seems the proper quasi-experiment---the PK result goes away. If anything, the signs flip to negative. (See table at end.)
In his 1999 reply to Morduch, Pitt takes on this issue. He experimentally redefines target status as "de facto target" OR "owning land below a certain threshold." (His Table 4.) This preserves the PK result. But it is still not a strict quasi-experimental rule because of the "de facto" component.
On the real source of identification in PK
A subtle point about the PK regressions that perhaps no one involved has fully appreciated till now is that they have two potential sources of identification: exogeneity assumptions, which PK emphasize because they embody the asserted quasi-experiment, and the nonlinearity in the first-stage Tobit modeling of credit. In fact, from an econometric point of view (leaving aside interpretation) either can suffice alone.
As for the nonlinearities, Wilde (2000) shows for example that a multi-stage probit model with no exclusion restrictions is identified. In other words, by analogy, we can introduce all the PK instruments linearly into the second stage, and the model, if correct, can still identify the impact of credit. We do that in the table below and, consistent with the empirics in Pitt's latest response, PK's results stand up fine. For intuition, imagine we knew that credit had two impacts on household consumption, one linear and one quadratic. A simple regression on credit and its square would identify both. Similarly, if the first-stage Tobit models are correct, they will still capture the non-linear causal channel from the instruments to the outcome via borrowing, even when the instruments are present linearly in the second stage.
Only one of the two bases of identification---the exogeneity assumptions---is rooted in the quasi-experiment. And it looks like that one is not the source of PK's results. If we retain the structure that embodies the exogeneity assumptions but drop the nonlinear modeling, the results go away, as noted in the last post for 2SLS and shown below for classical linear LIML.
But if we keep the Tobit modeling and drop the exogeneity assumptions, the results get stronger. The way we execute this may seem strange: the key exogeneity assumptions are that the dummies defining the male and female intent-to-treatment groups are unrelated to the second-stage errors. We do away with those assumptions by setting the dummies to 1 for all households. Concretely, we expand the samples of the first-stage equations to the full sample, so that all equations have the same sample and the same control set. This makes no sense in a standard treatment evaluation perspective, and doing this in 2SLS would leave the model unidentified. But it works in this nonlinear setting, in the sense that the estimator still converges.
All this suggests that the nonlinearity rather than quasi-experiment is the source of the identification of PK's results. The credit variables are picking up a strong nonlinear relationship between the controls and household consumption. Whether credit is truly one link in the chain of the causal mechanism, or merely proxying for some other nonlinear relationship is less clear. Broadly, this issue also goes back to Jonathan's paper. To investigate it, we try introducing the squares of all the non-dummy controls in the second stage. This also turns out to rob the credit variables of their explanatory power and undercuts the idea that the credit Tobits are part of the true model.
The table below shows what happens when we modify Pitt's recent replication, using my program and his data set, in various ways described above. The Stata do file runs on the data Pitt posted.
The first of the columns is Pitt's replication. The rest separately introduce various changes (except that the last makes two changes at once).
The second column bootstraps the standard errors, acting on advice for similar models from a new paper by Chiburis, Das, and Lokshin. (Bootstrapping is done sampling whole villages at a time because of within-village observation reweightings.) This widens the confidence intervals considerably but to my eye leaves the positive female credit coefficients still looking like more than a fluke.
The third instead models credit in the first stage as linear; the results go away, consistent with the ideal that the nonlinearity is key to identification.
The fourth puts all households in the treatment group, i.e., expands the samples of the first-stage equations to the full sample. The results become extremely strong. And despite their dubious meaning, they show a strong continuity with the original results.
The fifth undoes that change but introduces the entire instrument set linearly into the second stage. Though not shown in the table the instruments enter with a strong F statistic, just as in 2SLS in my last post (p=4.3×10--28). Yet the results on credit again remain strong.
The sixth instead adds the squares of the controls. This undoes the result.
The seventh modifies the original by setting up a proper quasi-experiment. "Intent to treat" is now based on the arguably exogenous (or at least external) half-acre rule. This destroys the result too.
The last column bootstraps the previous one.