CGD's New Data & Code Transparency Policy

August 01, 2011

CGD has just adopted a policy that I believe will improve the quality and usefulness of our work. We have decided to become more transparent. Henceforth, the presumption will be that when authors post publications on that involves quantitative analysis, they will also post the data and computer code needed to fully reproduce their results. That way, any visitor to the web site will in principle be able to check our work. (Not that we never shared data before.)To quote from the policy (on which, comments welcome):

CGD analyses should be acts of social science. By some definitions, a sine qua non of science is replicability. The responsibility for replicability is especially great for research that aims to influence policy and ultimately affect the lives of the poor. Bruce McCullough and Ross McKitrick put it well in their report, Check the Numbers: The Case for Due Diligence in Policy Formation:
When a piece of academic research takes on a public role, such as becoming the basis for public policy decisions, practices that obstruct independent replication, such as refusal to disclose data, or the concealment of details about computational methods, prevent the proper functioning of the scientific process and can lead to poor public decision making.
In fact, transparency has many benefits:
  • It makes analysis more credible.
  • It makes CGD more credible when it calls on other organizations, such as aid agencies, to be transparent.
  • Data and code are additional content, appreciated by certain audiences.
  • It increases citation of CGD publications---by people using associated data sets.
  • It curates, saving work that otherwise tends to get lost as the staff turns over.
  • Preparing code and data for public sharing improves the quality of research: researchers find bugs.
  • In the short term, CGD’s leadership in transparency will differentiate it from its peers. In the long term (one hopes), CGD’s leadership will raise standards elsewhere.
For me, the most interesting point that emerged from CGD's internal discussions about this policy came from my colleague Justin Sandefur. Sometimes data sets are obtained after lengthy and delicate negotiations with officials in governments or private companies, on the condition of confidentiality. To post the raw data publicly would burn many bridges. Perhaps more importantly, promptly sharing data sets assembled at great cost would give other researchers, rivals in pursuit of publication, a free ride. They would jump on the opportunity to generate papers from others' data. And when the benefits of data collection for the collector go down, then less data will be collected. In these cases, perhaps the processed data behind a given paper, that actually subject to statistical analysis, can still be shared promptly, with the raw data held back for a year or so. At any rate, the point stands that there can be real trade-offs in choosing transparency, and sometimes the right choice is to be less than fully transparent. For this reason, CGD's new policy is flexible. We will be testing the frontiers of transparency in the months to come, and invite you to watch us closely.Personally, the policy resonates with my experiences attempting to reproduce influential studies of impact of such things as foreign aid, financial sector expansion, and microcredit. To me, it always felt important to replicate these studies in order to examine their methods closely and reach my own conclusions about interpretation. Some of the authors of the studies I examined shared their data and code fully enough that replication was easy. In other cases, reconstruction was harder.In the case of the microcredit work, my coauthor Jonathan Morduch and I posted all the data and programs behind our attempted replications of the original studies. This eventually allowed Mark Pitt, one author of the original studies, to spot some discrepancies in our replications. This highly public revelation was not particularly pleasant for us, but it was healthy, and it served the cause of understanding the evidence on the impacts of microcredit. (For us anyway, it did not affect our conclusions and in fact strengthened them by improving our match to the original studies.)Fundamentally, then, the new data and code transparency policy is about putting the pursuit of truth first. We believe that this step is both right in itself and strategically smart. In statistical analysis, as in software, bugs are the norm. So placing more of CGD's work in the public domain will inevitably expose mistakes. That can be a daunting prospect for an organization that prizes its reputation for high-quality analysis. But transparency serves the public good. And serving the public good is what CGD, as a charity, should do. Moreover, the success of open source projects such as Wikipedia and Android reassures us that doing the right thing is wise. The flip side of catching more mistakes is better work. And that should lead to greater impact.More about CGD's Research Data Disclosure policy can be found here.


CGD blog posts reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions.