โก TL;DR
The single largest reanalysis study of the post-2020 DiD methodology revolution. Replicates 49 published TWFE panel-data papers in political science using modern heterogeneity-robust estimators (Borusyak-Jaravel-Spiess, Callaway-Sant'Anna, Sun-Abraham). Documents how often headline results are sensitive to estimator choice โ providing the first systematic empirical quantification of the methodology revolution's practical importance.
๐งฉ Setup & motivation
Since 2020, the DiD literature has documented that standard TWFE estimates can be biased when treatment effects are heterogeneous across cohorts or over time. The theoretical case is well-established. But the empirical question โ how often does this matter for published research? โ was open until this paper.
The authors collected all TWFE-based panel-data papers in the top 4 political-science journals (APSR, AJPS, JOP, BJPS) over a 5-year window, identified the 49 that met inclusion criteria, replicated the original analysis, and re-ran each with three modern heterogeneity-robust estimators. They compare headline coefficient magnitudes, significance levels, and overall conclusions.
๐ Main results
The headline finding
Of 49 papers replicated:
- ~30% of papers have headline coefficients that change substantially (>50% in magnitude) under a heterogeneity-robust estimator.
- ~15% of papers have headline coefficients that change sign or lose statistical significance.
- The remaining ~55% of papers are robust to estimator choice.
What predicts sensitivity?
Three factors predict whether a paper's results are sensitive:
- Staggered adoption: papers with heavy staggered timing are most sensitive (TWFE contamination biggest).
- Treatment-effect heterogeneity: when effects evolve over time, TWFE attenuates more.
- Sample composition: papers with few never-treated units have fewer "clean" comparisons.
Recommendations
The paper proposes a checklist of robustness checks: (i) Goodman-Bacon decomposition; (ii) at least one heterogeneity-robust estimator; (iii) Rambachan-Roth sensitivity bounds; (iv) cohort-specific event-study plots. The checklist is now standard practice in political science DiD papers.
๐ ๏ธ Implications for practice
- If you have a published TWFE result, run BJS / CS as a robustness check โ there's a 1-in-3 chance the magnitude changes substantially.
- The 5-year accumulation of TWFE papers contains many that don't replicate under modern estimators โ but most do.
- Submission norms in political science (and other fields adopting these checks) now expect at least one modern robust estimator alongside TWFE.
๐งญ Where this sits in the broader DiD literature
The empirical counterpart to Goodman-Bacon (2021, J Econometrics) and the post-2020 methodology revolution. Cited in BCCGS 2026 JEL as the systematic evidence that the revolution matters in practice. Methodologically similar to Brodeur et al.'s replication audits of p-values.
๐ฅ Read the paper
- Local PDF (~1.5 MB) โ instant, no external request
- arXiv 2309.15983
- APSR Cambridge