The Randomistas: RCTs, External Validity, and the Ethics of Field Experiments

The Revolution in How Economists Know Things

For most of its history, economics was a discipline that argued. Economists built models, derived implications, and debated whether the models were right. Empirical work existed, of course — economists estimated equations, ran regressions, and tested hypotheses — but the relationship between theory and data was often strained. The models were too simple to match the data, the data were too messy to test the models, and the profession spent as much time arguing about econometric methods as about economic substance.

Then, beginning in the 1990s and accelerating through the 2000s, something changed. A new generation of economists — trained in modern statistical methods, skeptical of grand theoretical claims, and obsessed with the question “how do you know?” — began to transform the discipline’s standards of evidence. They demanded causal identification: clear, convincing evidence that X actually causes Y, not just that X and Y happen to move together. And they found, in the randomized controlled trial and in a toolkit of quasi-experimental methods, a way to get it.

This transformation is sometimes called the credibility revolution, and it has reshaped economics in ways that are still unfolding. Two Nobel Prizes — to Abhijit Banerjee, Esther Duflo, and Michael Kremer in 2019 for their experimental approach to alleviating global poverty, and to Joshua Angrist and Guido Imbens in 2021 for their contributions to the analysis of causal relationships — have cemented its place in the discipline’s self-understanding. But the revolution has also provoked a fierce backlash, raising questions about external validity, publication bias, ethical boundaries, and the relationship between evidence and understanding.

Before the Revolution: The Problem of Identification

To understand why the credibility revolution matters, you need to understand the problem it was trying to solve.

Economics is, fundamentally, about cause and effect. Does education increase earnings? Do higher minimum wages reduce employment? Does foreign aid promote growth? Does microfinance alleviate poverty? These are causal questions: they ask whether changing one thing (education, wages, aid, credit) produces a change in another (earnings, employment, growth, poverty).

The problem is that observing a correlation between X and Y does not establish that X causes Y. People with more education earn more, but is that because education makes them more productive, or because more talented and motivated people get more education (and would earn more anyway)? Countries that receive more foreign aid have mixed growth records, but is that because aid is ineffective, or because aid goes disproportionately to countries that are struggling (and would grow slowly anyway)? The endogeneity problem — the fact that the “cause” and the “effect” are jointly determined by unobserved factors — is the central challenge of empirical economics.

Before the credibility revolution, economists addressed this problem primarily through structural modeling: building a theoretical model of the economic process, estimating its parameters using econometric methods, and using the estimated model to simulate counterfactuals. This approach was powerful in principle but fragile in practice. The results depended heavily on the model’s assumptions, and different assumptions could produce different — sometimes contradictory — conclusions. Critics argued that structural models were “incredible” (literally: not credible) because their identifying assumptions could not be tested.

The Randomized Controlled Trial

The gold standard for causal identification is the randomized controlled trial (RCT), borrowed from medicine and adapted for economics. The logic is simple and powerful.

Suppose you want to know whether providing free textbooks improves students’ test scores in Kenyan schools. You could compare schools that have textbooks with schools that do not, but any difference in test scores might be due to other factors — schools with textbooks might be in wealthier areas, with better-educated parents and better-funded teachers. To isolate the effect of textbooks, you randomly assign some schools to receive textbooks (the “treatment group”) and others to continue without them (the “control group”). Because the assignment is random, the two groups are, on average, identical in all observed and unobserved characteristics. Any difference in test scores between the two groups can be attributed to the textbooks, because there is no other systematic difference between the groups.

This is exactly what Michael Kremer and his collaborators did in a series of influential studies in the late 1990s and 2000s. Their findings were surprising: free textbooks did not significantly improve test scores for the average student. The textbooks were written in English, a language many students did not speak well enough to benefit from the books. This finding — counterintuitive, policy-relevant, and credible because of the randomized design — exemplified the power of the experimental approach.

Banerjee and Duflo, working through the Abdul Latif Jameel Poverty Action Lab (J-PAL) at MIT, expanded the RCT approach into a full-scale research program. They and their collaborators conducted hundreds of experiments on topics ranging from bed nets for malaria prevention to incentives for teacher attendance, from microfinance to immunization campaigns, from political participation to deworming. The results challenged conventional wisdom, revealed the importance of behavioral factors (small nudges could have large effects), and provided a level of causal evidence that was unprecedented in development economics.

The Quasi-Experimental Toolkit

Not everything can be randomized. You cannot randomly assign countries to adopt different trade policies, or randomly assign people to different levels of education, or randomly assign cities to different minimum wage laws. For these questions, economists have developed a set of quasi-experimental methods that exploit natural variation — accidental or institutional features that create something resembling a natural experiment.

Instrumental variables (IV): Find a variable (the “instrument”) that affects the treatment but has no direct effect on the outcome. Joshua Angrist used the Vietnam draft lottery as an instrument for military service: the lottery randomly assigned draft eligibility, which affected whether men served in the military, but the lottery number itself had no direct effect on later earnings. By comparing earnings across lottery numbers, Angrist could estimate the causal effect of military service on earnings.

Regression discontinuity (RD): Exploit a cutoff rule that creates a sharp boundary between treated and untreated groups. If students with a test score above 80 receive a scholarship and students below 80 do not, then students scoring 79 and 81 are nearly identical in all respects — except that one group got the scholarship. By comparing outcomes just above and just below the cutoff, you can estimate the causal effect of the scholarship.

Difference-in-differences (DiD): Compare the change over time in an outcome between a group affected by a policy change and a group not affected. David Card and Alan Krueger’s famous minimum wage study compared employment in fast-food restaurants in New Jersey (where the minimum wage increased) with employment in nearby Pennsylvania (where it did not), before and after the increase. The difference in the change in employment between the two states provided an estimate of the minimum wage’s effect.

These methods, along with others (synthetic control, event studies, bunching), form the core toolkit of the credibility revolution. They share a common logic: identify a source of variation that is plausibly exogenous (not caused by the outcome you are studying) and use it to isolate the causal effect of interest. Angrist and Imbens developed the theoretical framework — the local average treatment effect (LATE) — that specifies exactly what these methods identify and under what assumptions.

The Promise: Clean Causal Identification

The appeal of the experimental and quasi-experimental approach is that it provides clean causal identification — answers to the question “does X cause Y?” that do not depend on untestable theoretical assumptions. A well-designed RCT identifies the causal effect of the treatment under minimal assumptions (essentially, that the randomization worked). A well-designed quasi-experiment identifies a causal effect under assumptions that are specific, testable (at least in part), and transparent.

This is a genuine achievement. Before the credibility revolution, much of empirical economics was a muddle of correlations, questionable instruments, and structural assumptions that were impossible to verify. The new methods raised the bar for what counted as credible evidence and produced a body of findings that policymakers could use with greater confidence.

The impact on development policy has been particularly significant. J-PAL and its sibling organization, Innovations for Poverty Action (IPA), have influenced policies affecting hundreds of millions of people, from deworming programs in schools to the distribution of insecticide-treated bed nets to the design of government transfer programs. The evidence from RCTs has, in many cases, contradicted both expert opinion and common sense, revealing that interventions assumed to be effective were not, and that simple, cheap interventions could have surprisingly large effects.

The Critique: External Validity

The most serious critique of the experimental approach concerns external validity — the extent to which a result from one context can be generalized to other contexts.

An RCT tells you the effect of a treatment in a specific place, at a specific time, with a specific population, implemented by a specific organization. It tells you that providing free bed nets increased bed net usage in this set of villages in Kenya. But does the result apply to villages in India? To urban areas in Nigeria? To a government program rather than an NGO program? To a scaled-up program rather than a small pilot?

The honest answer is: not necessarily. An RCT identifies the causal effect of a treatment in the experimental context, but it provides no guarantee that the effect will be the same — or even similar — in a different context. The effect of a deworming program depends on the prevalence of worms, the quality of local health infrastructure, the behavior of teachers and parents, and countless other context-specific factors. The effect of a cash transfer depends on local prices, market structures, social norms, and political institutions. These factors vary enormously across contexts, and an RCT, by design, does not identify which factors drive the results and which are irrelevant.

This is the problem of external validity, and it is not a minor quibble. If the results of an RCT cannot be generalized, then the policy implications of the RCT are limited to the specific context in which it was conducted. This is useful — local evidence is better than no evidence — but it falls far short of the universal, context-free knowledge that the experimental approach sometimes seems to promise.

Site selection bias exacerbates the problem. RCTs are not conducted in randomly selected locations. They are conducted where there are research partnerships, willing governments, cooperative NGOs, and motivated researchers. These are not representative of the places where the results will eventually be applied. A program that works in a well-managed pilot site may fail when implemented by a dysfunctional bureaucracy at national scale.

Publication bias creates a further distortion. Studies that find significant effects are more likely to be published than studies that find null results. This means that the published literature overestimates the average effect of interventions, because the studies that found no effect are sitting in file drawers, unpublished and invisible.

The Structural Critique: Understanding vs. Measurement

A deeper critique, advanced most forcefully by Angus Deaton (himself a Nobel laureate), is that the experimental approach prioritizes measurement over understanding.

An RCT tells you whether a treatment works, but it does not tell you why it works. It does not identify the mechanism — the causal pathway through which the treatment produces its effect. And without understanding the mechanism, you cannot predict whether the treatment will work in a different context, or how to modify it for different circumstances, or what the long-run effects will be.

Deaton argues that “randomization does not help with this problem — randomization is not an epistemological panacea.” The credibility of an RCT comes from the internal validity provided by randomization, but internal validity is not enough. To make policy, you need to understand the economic process — the structure of the problem — not just the average treatment effect.

Martin Ravallion, a development economist at Georgetown, has raised similar concerns about what he calls “randomization mania” — the tendency to treat RCTs as the only credible form of evidence and to dismiss other forms of research (case studies, structural models, historical analysis, qualitative research) as inferior. Ravallion argues that this hierarchy of evidence is misguided: different questions require different methods, and the fetishization of RCTs has distorted the research agenda by channeling resources toward questions that can be studied experimentally and away from questions that cannot.

The Ethics of Field Experiments

RCTs in economics raise uncomfortable ethical questions that are not always confronted directly.

The most basic question is: is it ethical to randomize? In a medical trial, randomizing patients to a placebo group is justified by genuine uncertainty about whether the treatment works (the principle of “equipoise”). But in some economic experiments, the treatment is something — clean water, deworming pills, a cash transfer — that is believed to be beneficial. Randomizing means deliberately withholding a beneficial treatment from the control group. Is this ethical?

The standard defense is that, without the experiment, the treatment would not be available to anyone, because the evidence needed to justify large-scale implementation does not yet exist. The control group is not worse off than it would be without the experiment; it simply does not receive the additional benefit of the treatment. This defense is generally accepted, but it becomes strained when the treatment is a basic necessity (clean water, essential medicine) and the experimental period is long.

There are also concerns about informed consent. In medical trials, patients must be informed about the trial and consent to participate. In economic experiments, the “patients” are often communities, and the “treatment” is a policy intervention. Can a village chief consent on behalf of all villagers? Are the people affected by the experiment truly aware that they are part of a research study? The ethical standards for field experiments in economics have improved significantly in recent years, with the establishment of institutional review boards and ethical guidelines. But the standards are still less developed than in medicine, and the power dynamics between wealthy researchers and poor research subjects create risks of exploitation that must be taken seriously.

A more subtle concern involves what gets studied. The experimental approach works best for interventions that are discrete, well-defined, and implementable at the local level: bed nets, textbooks, cash transfers, microfinance, nudges. It works poorly for questions about systemic change: industrial policy, trade liberalization, institutional reform, macroeconomic management. The risk is that the prestige of the experimental method channels research attention toward small, incremental interventions and away from the structural changes that might have the largest effects on poverty and development.

Lant Pritchett has articulated this concern sharply, arguing that the experimental approach has produced a “kinky development” — obsessed with small-scale interventions that can be rigorously evaluated, while ignoring the large-scale processes (economic growth, institutional development, state capacity) that have historically been the main drivers of poverty reduction. “The best RCTs in the world,” Pritchett argues, “cannot tell you how to make a country rich.”

The Middle Ground

The most productive response to the debate over RCTs is neither uncritical enthusiasm nor wholesale rejection, but a recognition that RCTs are one tool among many — powerful for certain questions, inadequate for others, and always in need of interpretation and contextualization.

RCTs are excellent for answering focused causal questions: does this specific intervention, in this specific context, produce this specific outcome? They are less useful for answering the broader questions that motivate development economics: what drives economic growth, how do institutions form and change, why are some countries rich and others poor.

For these broader questions, other methods — structural models, historical analysis, cross-country comparisons, qualitative research, natural experiments — remain essential. The credibility revolution has raised the bar for causal evidence, and this is a genuine contribution. But it has not made other forms of knowledge obsolete. An understanding of economic development requires both rigorous measurement and theoretical understanding, both local evidence and structural analysis, both the precision of the experiment and the breadth of the historical narrative.

The most influential economists of the current generation understand this. Duflo has written about the limitations of the experimental approach and the need for theory to guide experimental design. Angrist has emphasized the importance of institutional knowledge and context. Even the most committed experimentalists recognize that an RCT is the beginning of the conversation, not the end.

The lesson for the intelligent reader is to be appropriately skeptical — not of RCTs as a method, but of the claim that any single method can provide all the answers. When someone tells you that an intervention “works” because an RCT says so, ask: works where? For whom? At what scale? Through what mechanism? And what about the interventions that cannot be randomized — the trade policies, the institutional reforms, the political changes — that may matter far more for the lives of the poor than any bed net or textbook?

The credibility revolution has made economics more honest about what it knows. The next step is to be equally honest about what it does not know — and cannot know — through experiments alone.