You may know that when I began my inquiry into microfinance I found myself drawn into scrutinizing what were then the leading studies of the impact of microcredit. The strategy was to reproduce ("replicate") those studies on my own computer as well as I could. In 2009, I wrote a paper with Jonathan Morduch that built on his earlier work in this vein, and argued that the leading studies had not, after all, succeeded in measuring impacts. This reconciled the contradiction between those older positive studies, with their encouragingly positive results, and the randomized ones just emerging with more muted findings. The most important of the papers we examined is by Mark Pitt of Brown University and Shahidur Khandker of the World Bank (Pitt & Khandker, or PK).
Last March, Mark Pitt responded to our work, in a paper of his own that sought "to correct the substauntial damage that their claims have caused to the reputation of microfinance as a means of alleviating poverty." He pointed out two ways in which we had diverged in "replicating" PK. These explained certain differences between our results and PK's. Soon after, I posted a preliminary reply. Now Jonathan and I have responded in full.
I think this latest volley is decisive. To give that claim credibility, I'll point out, as I did in the spring, that this is the first time that someone other than Mark Pitt has been able to closely reproduce the key statistical runs in PK. "That’s bound to lead to some interesting findings," I wrote. And it did. What we learned reinforces our original conclusion.
We have posted two papers. The first is a more or less point-by-point reply to Pitt's critiques, which we felt compelled to make because of the barbs about "fatal econometric errors," failing to use the data set he provided, etc. The second is a revamped analysis of PK, which incorporates Pitt's corrections and digs deeper. All data and computer code, as well as all files for the 2009 paper, are here.
Here, for experts, I will quote from the conclusion of the new revamped paper. Then I'll explain one of our discoveries in something closer to regular English.
From the conclusion:
Pitt and Khandker (1998) is in many ways a brilliant study of an important question. But its econometric sophistication backfires. The study is hard to understand and its complexities hide several major problems. These include:
- an imputation for the log of the treatment variable when it is zero that is undocumented, influential, and arbitrary at the margin;
- the absence of a discontinuity that is asserted as central to identification;
- a reclassification of formally ineligible but borrowing households as eligible, which presumably introduces endogeneity into the asserted quasi-experiment;
- evidence of a linear relationship between the instruments and the error;
- evidence of instrument weakness, especially when microcredit borrowings are disaggregated by sex;
- disappearance of the results when villages where both genders could borrow are excluded;
- bimodality driven by 16 outliers from 14 households.
I want to show you what that last bullet means. Most econometric analysis starts by stipulating a structure and estimating parameters. Sorry, I know that's not regular English. Here's an example. This is a graph of the height of a bunch of students (in centimeters) versus their age:
The textbook approach to quantifying the relationship between two such variables is to assume that it basically follows a straight line. Of course we don't expect all the dots to fit perfectly onto a line. Some will be above the average line and some below. Their deviations from the line might follow a bell curve, meaning that most deviations would be small, a few big, and that big positive deviations (kids quite tall for their age) would be as common as big negative deviations (short kids). Having assumed that height and age relate through this line structure (which seems reasonable from the graph), we would then have a computer figure out which line best fits the data. This line will be characterized a couple of numbers, such as its angle of ascent and where it crosses the vertical axis. Those numbers are called parameters. Typically the parameter of greatest interest is the angle or slope of the line: that tells you how fast kids grow, on average. (Or how much microcredit reduces poverty.)
This being a mathematical exercise, "fit" must be defined precisely before the computer can search for the best fit. Usually fit is defined by measuring how far each dot is above or below a candidate for the best-fit line, squaring those distances, then averaging them. That average squared distance is the thing to minimize in searching for the best line. One reason for this definition of fit is that it lets you use calculus to derive an elegant formula for the unique best fit line, which computers can apply in milliseconds, giving you something like this:
In Pitt & Khandker, the assumed relationships between such variables as household spending, land holdings (in acres), and microcredit borrowings are much more complicated. One reason is that a simple line fit for microcredit could predict that some households would borrow negative amounts, since unless the line is horizontal, one of its ends will dip negative. Negative borrowing is possible in the mathematical world but not in the real world. So PK elaborate their model to make it more realistic. One consequence is that when they go to fit their model to real-world data, there is no longer a quick, elegant formula for the best fit---in particular, for the estimated impact of microcredit on poverty.
This is why Mark Pitt had to program his computer to search for the best fit by trial and error. The process is standard in econometrics and is called Maximum Likelihood. It is analogous to a blind ant searching for the highest point in the Himalayas. The ant starts somewhere. It explores the immediate neighborhood. It determines which nearby point is highest and goes there. And it repeats, maybe millions of times, until it gets stuck at place where all neighboring points are downhill. Then the ant assumes it is at the highest point.
This may sound like a dumb strategy. The ant could easily get stuck at the top of some mountain other than Everest, and wrongly conclude that it had found the tallest peak. The strategy is not dumb (do you have a better idea?) but it is potentially problematic. Mathematically, the computer really is a blind ant, unable to scan the horizon for the highest peak. Users of Maximum Likelihood must hope that there is just one mountain on the landscape, so that it is impossible to get stuck on the wrong one. (In some cases, they know.) There is reason for their hope: a foundational theoretical finding is that if the model being fit is correct, the probability of there being more than one mountain peak goes to 0 as the number of data points increases.
I discovered that even though the PK data set is big (1,798 households surveyed three times), when the PK model is fit to this data, there are (at least) two mountains. In other words, within the PK analytical framework, there are two competing ways to explain the data; and they are nearly tied in quality of fit. Here is a cross-section of the terrain that is searched (which actually has about 250 dimensions):
The subtle peak on the right corresponds to PK's published finding that microcredit raised household spending (thus reduced poverty) in Bangladesh; that's where their computer ant got stuck. The 0.043 shown there translates into the figure that Muhammad Yunus used to cite, that "In a typical year 5 percent of Grameen borrowers…rise above the poverty level." (Look for the .0432 near the center of Table 2 in the PK paper.) But there is another peak on the left, a hair lower. Under the PK interpretation, this peak suggests that the impact of microcredit for women is, after all, negative (--0.018). This is essentially the peak where Jonathan and I got stuck in the 2009 version of our paper. On balance, I think this contour is best viewed not as twin peaks, but as one really wide peak that spans possibilities of both positive and negative impact. It indicates huge imprecision in the estimated impact of microcredit on poverty, which was till now undiscovered.
In sum, what this graph is saying is that if you buy the PK statistical approach (which Jonathan and I do not, for reasons explained elsewhere in our paper), then the impact of microcredit is either positive or negative.
Perhaps because I haven't looked that hard, I'm not aware of another example of bimodality being discovered after the fact in a published result. [Update, August 20, 2012: Now I am. Hat tip to Ben Bolker.]
For some rough intuition, imagine that in the height-age graph above, the pattern of dots looked not like a "/" but an "X". Then the computer would be torn between choosing a "/" or a "" as the best fit, two choices that would imply opposite conclusions for whether children grow or shrink as they age. In fact, the standard line-fitting algorithm would compromise with a "--" but you get the idea.
Why does this "bimodality" occur in the more complex context of PK? The theoretical prediction that bimodality should disappear in big data sets requires that the model being fit is correct. In the case at hand, we discovered that it is not. In particular, the distance of each household from the model's predictions, analogous to the distance from the best fit line above, do not follow the assumed bell curve pattern. PK do not report checking for this. There are more households with really high spending---much higher than predicted by the PK model---than with really low spending. The big spenders report buying land and property, paying dowry, and employing servants. Here's the graph showing the spread in spending. The long tail on the right, consisting of those big spenders, breaks the symmetry of what would otherwise be a pretty accurate bell curve:
The right tail might seem thin and innocuous but it throws the statistical analysis. To demonstrate this, we try dropping the highest-spending household observations one at a time from the analysis. As we drop more, recrunching the numbers each time, the two peaks gravitate together---and toward zero impact---and collapse into one after the 16 highest-spending observations are dropped. The graph below show this process. Scan it from right to left. The rightmost pair of dots corresponds to the twin peaks graphed above. The next rightmost pair shows how those peaks shift slightly after the highest-spending observation is dropped. And so on.
So if we drop the 16 highest-spending observations out of 5,218, the PK finding goes away. This is called sensitivity to outliers and is a serious problem. The conclusion that microcredit reduces poverty should not depend on the inclusion in the study of a handful of anomalous families.
Our paper probes the causes of this sensitivity. It apparently has to do with PK's attempt to separately measure the impact of lending to women and lending to men. The reasons are too technical to go into here (having to do with "weak instruments").
A big idea here is that while it is usually necessary to assume a structure such as a straight-line relationship in order to sic your computer on the hunt for the best fit, one should take time to assess whether the assumed structure is accurate. This is called "specification testing." We humans must be wiser than the blind ants we deploy.
For me, reporting these findings brings a sense of completion. My goal when I began this inquiry was to achieve a full understanding of the evidence base on the impact of microcredit, so that I could interpret it with confidence for others. Thanks to Mark Pitt's corrections last spring, Jonathan and I have, I believe, gotten to the core of this complicated and influential paper.
The experience corroborates my earlier work in showing that non-experimental studies are less reliable than they appear. It shows how complicated econometrics can hide rather than solve the fundamental obstacles to studying cause and effect. And it demonstrates the value of sharing data and code: had PK's data and computer code been publicly available, their work could have been independently examined long ago; and our own sharing allowed Pitt to correct our mistakes. None of us had a monopoly on the truth. Openness allowed us to move forward, however awkwardly.