We just ran 23 million queries of the World Bank's website. Technically, a piece of computer code did the work, occupying a PC in an empty cubicle in our office for about 9 weeks, gradually sweeping up nearly every bit of information available in the World Bank’s global database on poverty and inequality, known as PovcalNet.
Why did we go through all this trouble? The parochial answer is that we wanted to use the data for our own research and got frustrated with the World Bank website designed to dole out the data in bite-size chunks, rather than the large swaths researchers might want. After a somewhat, erm, delicate negotiation with colleagues at the World Bank, we’ve just posted the resulting paper, data set, and code online, so data-oriented readers can now download the full income and consumption distributions from 952 surveys across 127 countries over 35 years in a convenient set of CSV files, rather than running repetitive queries of the PovcalNet web interface.
The more grandiose motivation for our 23 million web queries is that a serious public debate about global poverty and inequality goals is potentially unfolding, and serious public debate requires transparent public access to the underlying data in question. In his 2013 State of the Union speech, President Obama pledged US aid to reach a new target of zero extreme poverty within two decades, and the new World Bank president Jim Kim has made that zero poverty target the new overarching goal of World Bank policy. If US government spending and World Bank loans will hinge on these numbers, then independent researchers ought to be able to replicate the calculations, debate the many difficult and sometimes questionable judgment calls that World Bank staff make along the way, and possibly propose alternatives methods.
Here are three steps the World Bank should take to make global poverty data open to the public:
1. Embrace open data standards. The PovcalNet website is great for many users. But for researchers who would like to seriously kick the tires behind the Bank’s calculations, it locks the data in an unnecessary straightjacket. Give us freely accessible, machine-readable files.
2. Post the code. There’s already some dense documentation on the PovcalNet website, but it has gaps. For instance, the description of how the World Bank aggregates national poverty rates up to regional and global estimates seems sensible, but quite vague. We found it impossible to replicate this aggregation, even after asking for help.
3. Release enough micro data to recreate the estimates. For many countries, there’s nothing preventing the World Bank from posting the entire unit-record micro data set, properly anonymized. For countries that object, the Bank could still release grouped data sufficient to replicate their calculations more or less from scratch.
In a matter of days, the World Bank will release the new purchasing power parity data from the International Comparison Project, which are the price deflators underlying all cross-country comparisons of poverty and real GDP. Rumors are swirling that the new numbers will lead to some significant revisions of earlier poverty and GDP estimates. What better way to keep the World Bank above the fray than by taking an aggressive stance on full data transparency?
Until then, we hope our clunky solution here will make it slightly easier for independent researchers to delve into the public debate.
 Interestingly, we're not the first people to try this. After the fact, we learned that Sanjay Reddy of the New School for Social Research made a similar effort several years ago, but the World Bank server hosting the poverty data crashed before the process was completed.
 To be clear, no amount of web-scraping on our end can do these things. We accessed only publicly available information on the Bank website — we're not hackers who gained access to anything confidential. So we can help a bit with #1, but not #2 or #3. Also, most of the numbers PovcalNet publishes are only modeled results, not “raw” data. That makes our web scraping look quite silly in some cases — sort of like doing handwriting analysis of a typed page. We wish the World Bank didn’t use these modeled approximations. Fortunately, as we document in our paper, for over 30% of the country-years in the database, they don’t. So while our web-scraping still doesn’t access the original survey data, it does turn up new information each time even after millions of queries