Causal Panel Analysis under Parallel Trends: Lessons from a Large Reanalysis Study

Albert Chiu (Stanford) · Xingchen Lan (NYU) · Ziyi Liu (Berkeley) · Yiqing Xu (Stanford)

⚡ TL;DR

The single largest reanalysis study of the post-2020 DiD methodology revolution. Replicates 49 published TWFE panel-data papers in political science using modern heterogeneity-robust estimators (Borusyak-Jaravel-Spiess, Callaway-Sant'Anna, Sun-Abraham). Documents how often headline results are sensitive to estimator choice — providing the first systematic empirical quantification of the methodology revolution's practical importance.

🧩 Setup & motivation

Since 2020, the DiD literature has documented that standard TWFE estimates can be biased when treatment effects are heterogeneous across cohorts or over time. The theoretical case is well-established. But the empirical question — how often does this matter for published research? — was open until this paper.

The authors collected all TWFE-based panel-data papers in the top 4 political-science journals (APSR, AJPS, JOP, BJPS) over a 5-year window, identified the 49 that met inclusion criteria, replicated the original analysis, and re-ran each with three modern heterogeneity-robust estimators. They compare headline coefficient magnitudes, significance levels, and overall conclusions.

📐 Main results

The headline finding

Of 49 papers replicated:

~30% of papers have headline coefficients that change substantially (>50% in magnitude) under a heterogeneity-robust estimator.
~15% of papers have headline coefficients that change sign or lose statistical significance.
The remaining ~55% of papers are robust to estimator choice.

What predicts sensitivity?

Three factors predict whether a paper's results are sensitive:

Staggered adoption: papers with heavy staggered timing are most sensitive (TWFE contamination biggest).
Treatment-effect heterogeneity: when effects evolve over time, TWFE attenuates more.
Sample composition: papers with few never-treated units have fewer "clean" comparisons.

Recommendations

The paper proposes a checklist of robustness checks: (i) Goodman-Bacon decomposition; (ii) at least one heterogeneity-robust estimator; (iii) Rambachan-Roth sensitivity bounds; (iv) cohort-specific event-study plots. The checklist is now standard practice in political science DiD papers.

🛠️ Implications for practice

If you have a published TWFE result, run BJS / CS as a robustness check — there's a 1-in-3 chance the magnitude changes substantially.
The 5-year accumulation of TWFE papers contains many that don't replicate under modern estimators — but most do.
Submission norms in political science (and other fields adopting these checks) now expect at least one modern robust estimator alongside TWFE.

🧭 Where this sits in the broader DiD literature

The empirical counterpart to Goodman-Bacon (2021, J Econometrics) and the post-2020 methodology revolution. Cited in BCCGS 2026 JEL as the systematic evidence that the revolution matters in practice. Methodologically similar to Brodeur et al.'s replication audits of p-values.

📥 Read the paper

Local PDF (~1.5 MB) — instant, no external request
arXiv 2309.15983
APSR Cambridge

Literature Readings