Commentary

Does Economics Have a Replication Crisis?

Psychology's reckoning with irreproducible results sent shockwaves through science. Economics faces similar problems — p-hacking, publication bias, and results that vanish when tested by independent teams — but the profession has been slower to confront them.

Reckonomics Editorial ·

The Psychology Precedent

In 2015, the Open Science Collaboration published a landmark study attempting to replicate 100 results from top psychology journals. Only 36% of the replications produced statistically significant results in the same direction as the originals. The effect sizes of the replications were, on average, half those of the original studies. Psychology’s replication crisis had arrived, and it sent a clear message to every empirical discipline: if you have not checked whether your results hold up, you probably should.

Economics, a discipline that increasingly relies on empirical methods and makes claims that directly influence public policy, has been slower to undertake this reckoning. But the evidence that has accumulated suggests the problems are real, if somewhat different in character.

The Evidence

Several major replication efforts have tested economics research. In 2016, Colin Camerer and colleagues attempted to replicate 18 laboratory experiments published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. Eleven of the 18 — 61% — replicated. This was better than psychology’s 36%, but it meant that nearly four in ten published experimental results did not hold up.

A larger effort by the Institute for Replication, launched in 2021, has systematically replicated studies across economics subfields. Their findings vary by area but consistently show that a substantial minority of published results — typically 20-40% — fail to replicate or show significantly smaller effects than originally reported.

The problem is not limited to laboratory experiments. In empirical economics — the workhorse of the discipline, relying on observational data and quasi-experimental methods — the challenges are arguably more severe, though harder to test directly. Researchers make dozens of decisions in every study: which variables to include, how to define the sample, which specification to use, how to handle outliers, which time period to study. Each decision is a degree of freedom, and the cumulative effect of these choices can determine whether a result is statistically significant.

P-Hacking and Specification Searching

“P-hacking” — running multiple specifications until one produces a statistically significant result (p < 0.05) and then reporting only that one — is the most commonly discussed form of questionable research practice. But in economics, the more insidious problem is what might be called “specification searching with plausible deniability.” Every choice a researcher makes can be justified on methodological grounds. Including a control variable is “standard practice.” Dropping an outlier is “data cleaning.” Using a different time period is “robustness analysis.” Each individual choice is defensible; the problem is that the cumulative effect of many small choices, all pushing in the direction of significance, can produce results that appear robust but are artifacts of the analytical process.

Brodeur, Cook, and Heaton (2020) analyzed the distribution of test statistics in economics papers and found a suspicious bunching of results just above conventional significance thresholds — exactly what you would expect if researchers were selecting specifications that produced significant results. The bunching was most pronounced in papers using observational data and least pronounced in randomized controlled trials, where researcher degrees of freedom are more constrained.

Publication Bias

The problem extends beyond individual researchers to the institutions that evaluate and publish their work. Journals prefer novel, surprising, statistically significant results. Null results — finding that an intervention had no effect or that two variables are unrelated — are difficult to publish. This creates a systematic distortion in the published literature: the studies that reach print are disproportionately likely to show effects, and the effect sizes are disproportionately likely to be inflated.

A 2019 study by Andrews and Kasy estimated that results significant at the 5% level were 15 to 30 times more likely to be published than null results in top economics journals. This means that the published literature is not a representative sample of all research conducted — it is a filtered sample, biased toward positive findings.

The policy consequences are real. If a published study finds that a minimum wage increase has no negative employment effects, and the null results showing negative effects are sitting in file drawers, policymakers are working with a distorted evidence base.

What Economics Has Done About It

To its credit, the economics profession has taken some steps. Pre-registration — publicly committing to an analysis plan before examining the data — has gained traction, particularly for randomized controlled trials. The American Economic Association created a registry for randomized trials in 2013. Several journals have introduced “registered reports,” where papers are accepted based on their design before results are known.

Data and code sharing requirements have become more common. The American Economic Review now requires authors to post replication files. This does not prevent questionable practices, but it enables independent verification.

Some economists have called for moving away from the p < 0.05 threshold entirely, either by adopting a stricter threshold (p < 0.005) or by shifting toward Bayesian methods that provide a more nuanced measure of evidence strength.

What Remains Undone

Despite these reforms, the incentive structure of academic economics still rewards novelty over reliability. Tenure, promotion, and prestige depend on publishing in top journals, which continue to favor surprising results. Replication studies — the bread and butter of credibility — remain difficult to publish and unrewarding for the researchers who conduct them. A young economist who spends two years replicating someone else’s work, even if the result is important, will have a thinner CV than one who produces novel findings.

The problem is compounded by the complexity of modern empirical methods. Difference-in-differences, regression discontinuity, instrumental variables, and synthetic control methods all involve subjective choices that can influence results. Unlike laboratory experiments, where replication means running the same procedure on a new sample, replicating an observational study often means making different defensible choices and seeing whether the result survives. The answer is frequently ambiguous.

Reproducibility is not the same as replication (and both matter)

A useful vocabulary distinction—borrowed from other sciences and now common in meta-research—separates reproducibility (can another team run your code on your data and get your numbers?) from replication (does the claim about the world stand up in a new sample, with independent analysis?). Economics has improved the first: replication packages on the American Economic Association’s Dataverse and journal data editors have made computational errors and code typos easier to catch. The second is harder: a policy-relevant effect of a minimum-wage change or a land-titling reform may vary by place and time, and the original significance star may reflect a local draw more than a universal law.

Observational work and the rise of the pre-analysis plan

Pre-registration spread quickly in randomized trials, where there is a clean before-and-after split between design and analysis. The frontier now is the pre-analysis plan (PAP) in quasi-experimental work: the authors commit in advance to a main specification, a set of pre-specified robustness checks, and rules for outlier treatment and subsamples. The goal is not to forbid all exploration; it is to separate confirmatory from exploratory claims in the reader’s mind and in meta-analyses that synthesize evidence.

Even with a PAP, real-world institutional knowledge can force sensible deviations (a sudden law change, a revised national-accounts benchmark). The humane standard is documentation: when analyst degrees of freedom collide with reality, say so transparently enough for a replicator to follow the fork in the road.

Heterogeneous effects, interactions, and the garden of forking paths

Many published effects are average treatments that hide large dispersion in the underlying parameters. Specification search interacts badly with heterogeneity: subgroup splits multiply the opportunity to find a flattering slice. Honest workflows pre-specify which interactions are theoretically motivated, use multiple-testing corrections when appropriate, and report a full distribution of estimates where possible—not just the one line in an abstract that “survives” a p-threshold.

International replication efforts and what they teach us

Replication is not just a U.S. journal problem. Development and labor economics have seen coordinated reproducibility audits and replication challenges with international teams working on the same canonical papers. Findings vary by field: field experiments in labor often hold up well; some structural exercises are hard to replicate without proprietary data; and some reduced-form papers are fragile to small window changes in panel data. The lesson is less a gloating “gotcha” and more a humble map of where our evidence base is thick versus thin.

Open code, software environments, and the durability of packages

A replication package that fails because a Stata package vanished or a Python version pin was missing is a reproducibility failure in a boring but important sense. Containers, lockfiles, and archived software environments are boring infrastructure that makes science durable enough to dispute a decade later—when the author may no longer answer email.

Machine learning, high-dimensional controls, and new researcher degrees of freedom

As machine-learning methods enter causal pipelines—double LASSO for confounder selection, causal forests for heterogeneous effects—new choices appear: penalty tuning, how folds are split in cross-fitting, and more. The promise is reduced bias; the risk is a different kind of forking path if the procedure is massaged until a clean picture emerges.

Meta-science, incentives, and what hiring committees can do

Top journals and granting agencies have begun to value replication as a contribution type—still unevenly. A healthier incentive set would treat a credible replication or a valuable reproduction-package review as first-class output for junior scholars, not a hobby for the public-spirited few.

How a careful reader (or a policymaker) should weigh one paper

A single estimating equation is rarely a license to move policy by itself. Readers should ask: Is the identification as clear as the abstract implies? Is the mechanism plausible? Do multiple independent designs in other places point the same way? And what would falsify the result? “Replication crisis” talk can sound nihilistic; the constructive reading is Socratic humility about certainty combined with institutional reforms that make honest uncertainty cheaper to express in print.

Systematic reviews, meta-analyses, and the ethics of “many studies”

Policymakers and journalists are often told that “a systematic review of dozens of studies” supports a position. Systematic review is a powerful tool; it is not a spell that removes bias. The same publication filters that skew individual papers also skew what enters the meta-analytic window: a literature dominated by significant estimates will tend to overstate average effects unless selection models and P-curve-style diagnostics are used carefully. Equally, true heterogeneity can make the average of published effects economically meaningless for a specific population or time period, even if every study was executed in good faith.

Prospective meta-analysis and data carpentry in multi-site RCTs (pre-specifying which sites join, and how site-level results will be combined) are therefore part of the open-science toolkit in economics, not a luxury. They align incentives so that a single “surprising” result does not crowd out a pattern of small but credible effects.

The classroom: teaching uncertainty without teaching cynicism

PhD and undergraduate training both face a pedagogical tension: students need the ladder of simple models to learn structure, but they also need a literate skepticism about “published results” in the wild. A practical compromise is to teach the replication agenda as part of methods from year one: show how one line in an output can move when you winsorize differently; show how a placebo test should look when identification is real; and assign reproduction of a famous paper with public data as a semester milestone—not to embarrass the original authors, but to convey the craft of honest empirics.

The contrast with other disciplines (and the humility in borrowing)

Biomedical research had its own replication reckoning, with preregistration and multi-site trials at scale. Economics cannot import the entire clinical apparatus—humans cannot be randomized the same way across all macro questions—but it can borrow norms about primary versus secondary endpoints, about stopping rules for exploratory mining of a panel, and about separating confirmatory grant-funded work from speculation in research proposals without pretending the line is always sharp.

A closing thought

Economics may not have a replication crisis as dramatic as psychology’s, but it has a credibility problem that is slowly being acknowledged. The stakes are high: economic research directly informs decisions about taxes, trade, regulation, and public spending that affect millions of lives. Getting it right — and knowing when we are getting it wrong — is not an academic luxury. It is a public responsibility.

Replication versus robustness: two different credibility exercises

Replication usually means an independent team reruns the same analysis on the same (or corrected) data and code, checking whether the published estimates and significance statements survive. Robustness means varying defensible choices—sample windows, control sets, clustering, functional forms—to see whether a claim is a narrow knife-edge or a sturdy pattern. Both matter, but they answer different anxieties. Replication catches outright errors, lost files, and p-hacked paths that never should have survived review; robustness speaks to whether a result is interesting only under a very particular garden of forking paths. A field that excels at robustness checks but weakly incentivizes replication can still accumulate a distorted public portfolio of “known facts.”

The many-analysts problem and structured transparency

Recent many-analysts and multi-team exercises—in which numerous independent groups estimate the same hypothesis with shared data—have illustrated how analyst degrees of freedom can scatter results even when nobody is acting in bad faith. That is not an argument against quantitative work; it is an argument for structured transparency: versioned data, pinned random seeds, explicit primary specifications, and logs of deviations. When teams pre-commit to analysis plans, the scientific gain is not mystique; it is a clearer separation between confirmatory tests and exploratory fishing that later gets oversold.

Teaching and hiring: why incentives still lag norms

Graduate training often rewards clever identification over painstaking verification. Junior scholars reasonably fear that two years spent replicating a high-profile paper will not count as “original research,” even if it prevents a harmful mistake. Until hiring committees and editors treat replication and data/code archaeology as first-class intellectual work—complete with clear standards for what counts as a successful replication attempt—the supply of verification will remain thinner than the demand from policymakers and journalists who want a simple binary: true or false.

What a careful reader can do before amplifying a headline effect

When you encounter a striking empirical claim—especially one that lands on social media—ask: Is the data public? Is there a pre-analysis plan or at least a dated replication archive? Did the authors report multiple outcomes and pre-specify which one is primary? Do results survive simple placebo windows and alternative samples that a skeptical colleague would propose in a seminar? None of these checks guarantees truth, but they separate papers that invite scrutiny from papers that merely invite retweets.

Meta-science and the long arc of cumulative economics

Replication efforts should be read in context: failure to replicate is not always proof of misconduct; sometimes it reveals heterogeneity the original paper understated, coding ambiguities, or changing institutional environments that make a reduced form unstable across decades. The deeper meta-scientific point is that economics, like other empirical sciences, progresses through iterative error correction—if the institutions of publishing and promotion make that iteration rare, the appearance of consensus can outrun the warrant for consensus. Open-science tooling (persistent repositories, containerized environments, diff-friendly workflows) lowers the technical cost of verification; the remaining bottleneck is cultural.

Looking ahead: journals, registries, and the politics of evidence

Registered reports, results-blind review stages, and journals that publish high-quality non-results are not cosmetic reforms—they re-weight what gets sampled from the universe of studies researchers actually run. They also interact awkwardly with media demand for “surprising” findings. A sustainable compromise is to treat empirical economics less like a trophy case of headline coefficients and more like an inventory of measured patterns with known fragility—closer to how engineering disciplines treat material properties under stated conditions. That shift would not end debate; it would make debate more honest about what we know, how we know it, and what would falsify it.