Refutation: quick checks and sensitivity#
Use these tools after you estimate a causal effect (e.g., with DML/IRM) to pressure‑test your result. They cover placebo tests, overlap/positivity, score orthogonality, balance under IPW, and bias‑aware sensitivity.
Prerequisites
A CausalData object (see the Inference and CausalData guides)
An inference result dict like res = dml_ate(causal_data, store_diagnostic_data=True, …)
Tip: For ATE the DR/AIPW score is ψ = (g1 − g0) + (Y − g1)·D/m − (Y − g0)·(1−D)/(1−m). In large samples E[ψ]=0 and θ̂ = E_n[ψ]. Diagnostics below check this and related ingredients.
1) Placebo and subset checks#
Goal: detect spurious signal and fragility to sample size.
Expectation: with placebo outcome/treatment the effect should be near 0 with a large p‑value; under sub‑sampling the estimate should be stable within noise.
Python
from causalis.refutation import (
refute_placebo_outcome, refute_placebo_treatment, refute_subset
)
from causalis.inference.ate import dml_ate
# Placeholder: build CausalData as in the guides
causal_data = ... # your CausalData
# Placebo: replace Y with noise
placebo_y = refute_placebo_outcome(dml_ate, causal_data, random_state=42)
# Placebo: shuffle D
placebo_d = refute_placebo_treatment(dml_ate, causal_data, random_state=42)
# Subset: re‑estimate on a fraction of data
subset = refute_subset(dml_ate, causal_data, fraction=0.5, random_state=42)
print(placebo_y) # {'theta': ~0, 'p_value': large}
2) Overlap (positivity) diagnostics#
Goal: verify usable overlap and stable IPW weights.
Key metrics (summary keys):
edge_mass: share of m̂ near 0/1 (danger ⇒ exploding weights). Rule of thumb: for ε=0.01, yellow ≈ 2%, red ≈ 5% on either side; for ε=0.02, yellow ≈ 5%, red ≈ 10%.
ks: two‑sample KS on m̂ for treated vs control (0 best, 1 worst). Red if > 0.35.
auc: D predictability from X via m̂ ranks. Good overlap ⇒ ≈ 0.5.
ate_ess: effective sample size ratio in treated weights (higher is better; near 1 good).
ate_tails: tail indices like q99/median of treated weights; >10 caution, >100 red.
att_weights.ATT_identity_relerr: relative error of ATT odds identity (≤5% green; 5–10% yellow; >10% red).
clipping: how many propensities were clipped (large values signal poor overlap).
calibration: ECE/slope/intercept for m̂ probability calibration (ECE small is good).
Python
from causalis.refutation.overlap import run_overlap_diagnostics
res = ... # your inference result dict
rep = run_overlap_diagnostics(res=res) # or run_overlap_diagnostics(m_hat=m, D=D)
print(rep['summary'])
# Also available: rep['edge_mass'], rep['ks'], rep['auc'], rep['ate_ess'],
# rep['ate_tails'], rep['att_weights'], rep['clipping'], rep['calibration']
3) Score/orthogonality diagnostics (Neyman)#
Goal: confirm E[ψ]=0 holds out‑of‑sample and that ψ is insensitive to small nuisance perturbations.
Readouts:
oos_moment_test: t‑stats oos_tstat_fold and oos_tstat_strict ≈ N(0,1); values near 0 are good.
orthogonality_derivatives: max |t| for directions g1, g0, m (ATE) or g0, m (ATT). Rule of thumb: max |t| ≲ 2 is okay.
influence_diagnostics: tail metrics of ψ (psi_p99_over_med, psi_kurtosis). Large values flag instability and heavy tails.
Python
from causalis.refutation.score.score_validation import run_score_diagnostics
res = ... # your inference result dict
rep_score = run_score_diagnostics(res=res)
print(rep_score['summary'])
# Details: rep_score['oos_moment_test'], ['orthogonality_derivatives'], ['influence_diagnostics']
4) Unconfoundedness/balance checks#
Goal: see if IPW (for your estimand) balances covariates.
Metrics:
balance_max_smd: max standardized mean difference across covariates (smaller is better; ≤0.1 common rule).
balance_frac_violations: share of covariates with SMD ≥ threshold (default 0.10).
Python
from causalis.refutation.unconfoundedness.uncofoundedness_validation import (
run_unconfoundedness_diagnostics
)
res = ... # your inference result dict
rep_uc = run_unconfoundedness_diagnostics(res=res)
print(rep_uc['summary'])
6) SUTVA (design prompts)#
These assumptions aren’t testable from data. Use the built‑in checklist to document design decisions.
from causalis.refutation.sutva.sutva_validation import print_sutva_questions
print_sutva_questions()
Minimal workflow (ATE)#
from causalis.inference.ate import dml_ate
from causalis.refutation import (
refute_placebo_outcome, refute_placebo_treatment, refute_subset
)
from causalis.refutation.overlap import run_overlap_diagnostics
from causalis.refutation.score.score_validation import run_score_diagnostics
from causalis.refutation.unconfoundedness.uncofoundedness_validation import (
run_unconfoundedness_diagnostics
)
from causalis.refutation.unconfoundedness.sensitivity import sensitivity_analysis
# Build your CausalData as in the guides
causal_data = ... # your CausalData
# Estimate effect (stores nuisances needed for diagnostics)
res = dml_ate(causal_data, n_folds=4, store_diagnostic_data=True, random_state=123)
# 1) Placebo + subset
_ = refute_placebo_outcome(dml_ate, causal_data, random_state=42)
_ = refute_placebo_treatment(dml_ate, causal_data, random_state=42)
_ = refute_subset(dml_ate, causal_data, fraction=0.5, random_state=42)
# 2) Overlap
ov = run_overlap_diagnostics(res=res); print(ov['summary'])
# 3) Score/orthogonality
sc = run_score_diagnostics(res=res); print(sc['summary'])
# 4) Balance
uc = run_unconfoundedness_diagnostics(res=res); print(uc['summary'])
# 5) Sensitivity (simple example)
print(sensitivity_analysis(res, cf_y=0.02, cf_d=0.02, rho=0.5, level=0.95))
Notes
Always stratify folds by D and enable clipping of extreme propensities; large clipping/edge masses indicate weak overlap.
For ATT, focus on g0 and m diagnostics and the ATT weight identity in the overlap report.
Heavy ψ tails often trace back to poor overlap; consider trimming, better learners, or different features.