BRIEF

Generative AI Evaluation Playbook: Policy Brief

From math tutors to farmer advisory tools, generative AI (GenAI) is rapidly expanding in low- and middle-income countries. While some evidence shows development gains, other findings point to harm. Evaluations can assess and help address these risks, but there is little agreement on what they should include. Tech teams prioritize product performance, often overlooking impact, while impact evaluators focus on outcomes but may neglect the underlying technology.

The Center for Global Development convened 30 experts across computer science, economics, gender studies, and development to bridge this gap. Experts aligned on a standard set of evaluation practices captured in the Generative AI Evaluation Playbook, released in April 2026. The practices are organized into four levels that each address a key question:

  • Level 1 – Model evaluation: Does the AI system perform as intended?
  • Level 2 – Product evaluation: Does the overall product engage and retain users?
  • Level 3 – User evaluation: Does the product impact users’ thoughts, feelings, and behavior towards the development outcome?
  • Level 4 – Impact evaluation: Does the product improve development outcomes?

An accompanying policy brief summarizes the playbook’s core concepts—what to evaluate at each level, why it matters, who should do it, and how. It also outlines Minimum Viable Evaluations (MVEs), which are the most basic set of practices organizations should pursue at each level.

Graphic showing arrows between the four levels of evaluation

Source: Generative AI Evaluation Playbook

How to use the Playbook

Builders of GenAI tools: The primary audience includes development practitioners, engineers, product managers, data scientists, behavioral researchers, and impact evaluators. They can use the playbook to plan evaluations and revisit it to check progress.

Funders and policymakers: While they may not use it daily, they can reference the playbook to define high-quality evaluations and encourage grantee adherence.

The Playbook is a living document, updated as new practices emerge. Builders of GenAI tools are encouraged to contribute amendments and case studies. Beyond detailed implementation guidance for each level, the Playbook discusses cross-cutting themes such as risk mitigation.

Read the full policy brief here.

CITATION

Sheng Chia, Han, and Tim Ohlenburg. 2026. Generative AI Evaluation Playbook: Policy Brief. Center for Global Development.

DISCLAIMER & PERMISSIONS

CGD's publications reflect the views of the authors, drawing on prior research and experience in their areas of expertise. CGD is a nonpartisan, independent organization and does not take institutional positions. You may use and disseminate CGD's publications under these conditions.


Thumbnail image by: C. De Bode/CGIAR