Refutation of DML IRM inference#
This notebook explain refutation tests that implemented in Causalis DGP took from benchmarking notebook
import numpy as np
from typing import List, Dict, Any, Tuple
from causalis.data import CausalDatasetGenerator
# 1) Confounders
confounder_specs: List[Dict[str, Any]] = [
{"name": "tenure_months", "dist": "normal", "mu": 24, "sd": 12},
{"name": "avg_sessions_week", "dist": "normal", "mu": 5, "sd": 2},
{"name": "spend_last_month", "dist": "uniform", "a": 0, "b": 200},
{"name": "premium_user", "dist": "bernoulli", "p": 0.25},
{"name": "urban_resident", "dist": "bernoulli", "p": 0.60},
]
# 2) Feature index map
def feature_indices_from_specs(specs: List[Dict[str, Any]]) -> Dict[str, Tuple[int, ...]]:
idx, out = 0, {}
for spec in specs:
name = spec.get("name", "")
dist = str(spec.get("dist","normal")).lower()
if dist in ("normal","uniform","bernoulli"):
out[name] = (idx,); idx += 1
else:
raise ValueError(f"Unsupported dist: {dist}")
return out
feat = feature_indices_from_specs(confounder_specs)
def col(X, key): return X[:, feat[key][0]]
def _log1p_pos(x): return np.log1p(np.clip(x, 0.0, None))
def _sqrt_pos(x): return np.sqrt(np.clip(x, 0.0, None))
def _ind(cond): return cond.astype(float)
def _sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
# 3) g_d(x) - nonlinear connection between X and D
def g_d(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
tau_align = tau_func(X) # explicit alignment
return (
1.10 * np.tanh(0.06*(spend - 100.0))
+ 1.00 * _sigmoid(0.60*(sess - 5.0))
+ 0.50 * _log1p_pos(tenure)
+ 0.50 * prem
+ 0.25 * urban
+ 0.90 * prem * _ind(spend > 120.0)
+ 0.30 * urban * _ind(tenure < 12.0)
+ 0.80 * tau_align # direct alignment term (λ)
)
# 4) g_y(x) - nonlinear connection between X and Y
def g_y(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
return (
0.70 * np.tanh(0.03*(spend - 80.0))
+ 0.50 * _sqrt_pos(sess)
+ 0.40 * _log1p_pos(tenure)
+ 0.30 * prem
+ 0.10 * urban
- 0.10 * _ind(spend < 20.0)
)
# 5) tau(x) — nonlinear effect function (CATE)
def tau_func(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
return (
0.40
+ 0.60 * (1.0 / (1.0 + np.exp(-0.40*(sess - 5.0)))) # sigmoid
+ 2 * prem * _ind(spend > 120.0)
+ 0.10 * urban * _ind(tenure < 12.0)
)
# 6) Generator — continuous outcome
gen = CausalDatasetGenerator(
theta=0.0, # ignored; we pass tau
tau=tau_func, # nonlinear effect
beta_y=None, beta_d=None, # use nonlinear g_* only
g_y=g_y, g_d=g_d, # nonlinear functions
alpha_y=0.0, # baseline mean level, intercept
alpha_d=0.0, # will be calibrated to target_d_rate
sigma_y=1.0, # noise std for Y
outcome_type="continuous", # outcome distribution
confounder_specs=confounder_specs,
target_d_rate=0.20, # 20% will be treated
u_strength_d=0.0, # strength of latent confounder influence on treatment
u_strength_y=0.0, # strength of latent confounder influence on outcome
propensity_sharpness=1, # increase to make overlap harder
seed=123 # random seed for reproducibility
)
# 7) Generate
n = 10_000 # Number of observations
df = gen.generate(n)
print("Treatment share ≈", df["d"].mean())
true_ate = float(df["cate"].mean())
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
# Ground-truth ATT (on the natural scale): E[tau(X) | T=1] = mean CATE among the treated
true_att = float(df.loc[df["d"] == 1, "cate"].mean())
print(f"Ground-truth ATT from the DGP: {true_att:.3f}")
Treatment share ≈ 0.2036
Ground-truth ATE from the DGP: 0.913
Ground-truth ATT from the DGP: 1.567
from causalis.data import CausalData
causal_data = CausalData(
df=df,
treatment="d",
outcome="y",
confounders=["tenure_months",
"avg_sessions_week",
"spend_last_month",
"premium_user",
"urban_resident"]
)
causal_data.df.head()
| y | d | tenure_months | avg_sessions_week | spend_last_month | premium_user | urban_resident | |
|---|---|---|---|---|---|---|---|
| 0 | 0.689404 | 0.0 | 12.130544 | 4.056687 | 181.570607 | 0.0 | 0.0 |
| 1 | 3.045282 | 0.0 | 19.586560 | 1.671561 | 182.793598 | 0.0 | 0.0 |
| 2 | 7.173595 | 1.0 | 39.455103 | 5.452889 | 125.185708 | 1.0 | 1.0 |
| 3 | 1.926216 | 0.0 | 26.327693 | 5.051629 | 4.932905 | 0.0 | 1.0 |
| 4 | 1.225088 | 0.0 | 35.042771 | 4.933996 | 23.577407 | 0.0 | 0.0 |
Inference#
from causalis.inference.ate import dml_ate
# Estimate Average Treatment Effect (ATE)
ate_result = dml_ate(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=True, random_state=123)
print(ate_result.get('coefficient'))
print(ate_result.get('p_value'))
print(ate_result.get('confidence_interval'))
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
0.9917276396749556
0.0
(0.869543879249174, 1.1139114001007373)
Ground-truth ATE from the DGP: 0.913
As we see our estimate is accurate and CI bounds include ground-truth ATE. In real life we can’t compare estimation with truth so we need check robustness of it: run some tests on assumptions and answer questions about research design
Overlap#
What “overlap/positivity” means#
Binary treatment \(T \in \{0,1\}\): for all confounder values \(x\) in your target population,
often strengthened to strong positivity: there exists an \(\varepsilon > 0\) such that
Why it matters#
Identification: Overlap + unconfoundedness are the two pillars that identify causal effects from observational data. Without overlap, the effect is not identified — you must extrapolate or model-specify what never occurs.
Estimation stability: IPW/DR estimators use weights
\[ w_1 = \frac{D}{e(X)}, \qquad w_0 = \frac{1 - D}{1 - e(X)}. \]If \(e(X)\) is near 0 or 1, these weights explode, causing huge variance and fragile estimates.
Target population: With trimming or restriction, you may change who the effect describes — e.g., ATE on the region of common support, not on the full population.
It’s summary for overlap diagnostics
from causalis.refutation import *
rep = run_overlap_diagnostics(res=ate_result)
rep["summary"]
| metric | value | flag | |
|---|---|---|---|
| 0 | edge_0.01_below | 0.000000 | GREEN |
| 1 | edge_0.01_above | 0.000000 | GREEN |
| 2 | edge_0.02_below | 0.077300 | YELLOW |
| 3 | edge_0.02_above | 0.000400 | YELLOW |
| 4 | KS | 0.511643 | RED |
| 5 | AUC | 0.835125 | YELLOW |
| 6 | ESS_treated_ratio | 0.247034 | YELLOW |
| 7 | ESS_control_ratio | 0.327069 | GREEN |
| 8 | tails_w1_q99/med | 38.676284 | YELLOW |
| 9 | tails_w0_q99/med | 20.575638 | YELLOW |
| 10 | ATT_identity_relerr | 0.177229 | RED |
| 11 | clip_m_total | 0.023600 | YELLOW |
| 12 | calib_ECE | 0.018453 | GREEN |
| 13 | calib_slope | 0.889332 | GREEN |
| 14 | calib_intercept | -0.106806 | GREEN |
edge_mass#
edge_0.01_below, edge_0.01_above, edge_0.02_below, edge_0.02_above are shares of units whose propensity is below or above the percents
To keep in mind: DML IRM is clipping out the interval [0.02 and 0.98]
Huge shares are dangerous for estimation in terms of weights exploding
Flags in Causalis:
For \(ε=0.01\): YELLOW if either side 0.02 (2%), RED if 0.05 (5%).
For \(ε=0.02\): YELLOW if either side 0.05 (5%), RED if 0.10 (10%).
rep['edge_mass']
{'share_below_001': 0.0,
'share_above_001': 0.0,
'share_below_002': 0.0773,
'share_above_002': 0.0004,
'min_m': 0.01,
'max_m': 0.99}
# You can also check shares per arm
rep['edge_mass_by_arm']
{'share_below_001_D1': 0.0,
'share_above_001_D0': 0.0,
'share_below_002_D1': 0.010805500982318271,
'share_above_002_D0': 0.00025113008538422905}
ks - Kolmogorov–Smirnov statistic#
Here KS is the two-sample Kolmogorov–Smirnov statistic comparing the distributions of the propensities for treated vs control:
Interpretation:
(D=0): identical distributions (perfect overlap).
(D=1): complete separation (no overlap).
Your value KS = 0.5116 means there exists a threshold (t) such that the share of treated with \((m\le t)\) differs from the share of controls with \((m\le t)\) by ~51 percentage points. That’s why it’s flagged RED (your thresholds mark RED when (D>0.35)): treatment assignment is highly predictable from covariates ⇒ poor overlap / strong confounding risk.
rep['ks']
0.5116427657267132
AUC#
Probability definition (most intuitive)#
where \((s^+)\) is a score from a random positive and \((s^-)\) from a random negative. So AUC is the fraction of all \((n_1 n_0)\) positive–negative pairs that are correctly ordered by the score (ties get half-credit).
Rank / Mann–Whitney formulation#
Rank all scores together (ascending). If there are ties, assign average ranks within each tied block.
Let \((R_1)\) be the sum of ranks for the positives.
Compute the Mann–Whitney (U) statistic for positives:
\[ U_1 = R_1 - \frac{n_1(n_1+1)}{2}. \]Convert to AUC by normalizing:
\[ \boxed{\text{AUC} = \frac{U_1}{n_1 n_0} = \frac{R_1 - \frac{n_1(n_1+1)}{2}}{n_1 n_0}} \]This is exactly what your function returns (with stable sorting and tie-averaged ranks).
ROC-integral view (equivalent)#
If \((\text{TPR}(t))\) and \((\text{FPR}(t))\) trace the ROC curve as the threshold \((t)\) moves,
i.e., the geometric area under the ROC.
Properties you should remember#
Range: \((0 \le \text{AUC} \le 1)\); 0.5 = random ranking; 1 = perfect separation.
Symmetry: \((\text{AUC}(s,y) = 1 - \text{AUC}(s,1-y))\).
Monotone invariance: Any strictly increasing transform \((f)\) leaves AUC unchanged (only ranks matter).
Ties: Averaged ranks ⇒ adds the \((\tfrac12\Pr(s^+=s^-))\) term automatically.
In the propensity/overlap context#
A higher AUC means treatment (D) is more predictable from covariates (bad for overlap/positivity).
For good overlap you actually want AUC close to 0.5.
rep['auc']
0.8351248965136829
ESS_treated_ratio#
Weights used#
For ATE-style IPW, the treated-arm weights are
so on the treated subset \(({i:D_i=1})\) the weights are simply \((1/m_i)\).
Effective sample size (ESS)#
Given the treated-arm weights \((w_1,\ldots,w_{n_1})\) (only for \((D=1)\)),
This is exactly what _ess(w) computes.
If all treated weights are equal, ESS \((= n_1)\) (full efficiency).
If a few weights dominate, ESS drops (information concentrated in few units).
The reported metric#
This lies in \((0,1]\). Near 1 ⇒ well-behaved weights; near 0 ⇒ severe instability.
Why it reflects overlap#
When propensities \((m_i)\) approach 0 for treated units, weights \((1/m_i)\) explode → large CV → low ESS_treated_ratio. Hence this metric is a direct, quantitative read on how much usable information remains in the treated group after IPW.
print(rep['ate_ess'])
{'ess_w1': 502.960611352384, 'ess_w0': 2604.778026253534, 'ess_ratio_w1': 0.2470336990925265, 'ess_ratio_w0': 0.32706906407000674}
tails_w1_q99/med#
Interpretation#
It’s a tail-heaviness index for treated weights: how large the 99th-percentile weight is relative to a typical (median) weight.
Scale-invariant: if you re-scale weights (e.g., Hájek normalization), both numerator and denominator scale equally, so the ratio is unchanged.
Bigger \((\Rightarrow)\) heavier right tail \((\Rightarrow)\) more variance inflation for IPW (since variance depends on large \((w_i^2)\)). It typically coincides with a low \(ESS(_\text{treated ratio}\)).
Edge cases & thresholds#
If \((\text{median}(W_1)=0)\) or undefined, the ratio is not meaningful (your code returns “NA” in that case; with positive treated weights this is rare).
Defaults: YELLOW if any of \(({q95/med,q99/med,q999/med,\max/med})\) exceeds 10; RED if any exceed 100.
tails_w1_q99/medis one of these checks, focusing specifically on the 99th percentile.
Quick example#
If \(\mathrm{median}(W_1) = 1.2\) and \(Q_{0.99}(W_1) = 46.8\), then
indicating heavy tails and a likely unstable ATE IPW.
print(rep['ate_tails'])
{'w1': {'q50': 2.585563809098619, 'q95': 26.04879283279811, 'q99': 100.0, 'max': 100.0, 'median': 2.585563809098619}, 'w0': {'q50': 1.073397908573178, 'q95': 1.6626662464619888, 'q99': 22.085846285748335, 'max': 99.99999999999991, 'median': 1.073397908573178}}
ATT_identity_relerr#
With estimated propensities \((m_i=\hat m(X_i))\) and \((D_i\in{0,1})\):
Left-hand side (controls odds sum): $\( \text{LHS} = \sum_{i=1}^n (1-D_i),\frac{m_i}{1-m_i}. \)$
Right-hand side (treated count): $\( \text{RHS} = \sum_{i=1}^n D_i = n_1. \)$
If \((\hat m\approx m)\) and overlap is ok, LHS \((\approx)\) RHS.
You report the relative error:
(when \((n_1>0)\) otherwise it’s set to \((\infty)\)).
How to read the number#
Small \((\texttt{relerr})\) (e.g., (\le 5%)) ⇒ propensities are reasonably calibrated (especially on the control side) and ATT weights won’t be wildly off in total mass.
Large \((\texttt{relerr})\) ⇒ possible miscalibration of \((\hat m)\) (e.g., over/underestimation for controls), poor overlap (many controls with \((m_i\to 1)\) inflating \((m_i/(1-m_i)))\), or clipping/trimming effects.
Your default flags (same as in the code):
GREEN if \((\texttt{relerr} \le 0.05)\)
YELLOW if \((0.05 < \texttt{relerr} \le 0.10)\)
RED if \((> 0.10)\)
Quick intuition#
The term \((m/(1-m))\) is the odds of treatment. Summing that over controls should reconstruct the treated count. If it doesn’t, either the odds are off (propensity miscalibration) or the data lack support where you need it—both are red flags for ATT-IPW stability.
print(rep['att_weights'])
{'lhs_sum': 2396.83831663893, 'rhs_sum': 2036.0, 'rel_err': 0.17722903567727402}
clip_m_total#
look at edge_mass
print(rep['clipping'])
{'m_clip_lower': 0.0235, 'm_clip_upper': 0.0001, 'g_clip_share': nan}
calib_ECE, calib_slope, calib_intercept#
calib_ECE = 0.018 (GREEN)#
Math: with 10 equal-width bins,
(weighted average gap between observed rate (\bar y_k) and mean prediction (\bar p_k) per bin). Result: ~1.8% average miscalibration → overall probabilities track outcomes well. Note the biggest bin error is in 0.5–0.6 (abs_error ≈ 0.162) but it’s tiny (95/10,000), so ECE stays low.
calib_slope (β) = 0.889 (GREEN)#
Math (logistic recalibration):
Interpretation: (\beta<1) ⇒ predictions are a bit over-confident (too extreme); the optimal calibration slightly flattens them toward 0.5.
calib_intercept (α) = −0.107 (GREEN)#
Math: same model as above; (\alpha) is a vertical shift on the log-odds scale. Interpretation: Negative (\alpha) nudges probabilities downward overall (your model is, on average, a bit high), consistent with bins like 0.5–0.6 where \((\bar p_k > \bar y_k)\).
All three fall well within your GREEN thresholds, so calibration looks solid despite minor mid-range overprediction.
print(rep['calibration'])
{'n': 10000, 'n_bins': 10, 'auc': 0.8351248965136829, 'brier': 0.10778483728183921, 'ece': 0.018452696253043466, 'reliability_table': bin lower upper count mean_p frac_pos abs_error
0 0 0.0 0.1 5089 0.044279 0.054431 0.010152
1 1 0.1 0.2 1724 0.148308 0.172274 0.023965
2 2 0.2 0.3 1171 0.245443 0.231426 0.014017
3 3 0.3 0.4 626 0.341419 0.316294 0.025125
4 4 0.4 0.5 251 0.440675 0.382470 0.058205
5 5 0.5 0.6 95 0.541182 0.378947 0.162235
6 6 0.6 0.7 107 0.649163 0.588785 0.060378
7 7 0.7 0.8 214 0.754985 0.785047 0.030061
8 8 0.8 0.9 427 0.851461 0.859485 0.008024
9 9 0.9 1.0 296 0.932652 0.888514 0.044139, 'recalibration': {'intercept': -0.10680601474031537, 'slope': 0.8893319945661962}, 'flags': {'ece': 'GREEN', 'slope': 'GREEN', 'intercept': 'GREEN'}, 'thresholds': {'ece_warn': 0.1, 'ece_strong': 0.2, 'slope_warn_lo': 0.8, 'slope_warn_hi': 1.2, 'slope_strong_lo': 0.6, 'slope_strong_hi': 1.4, 'intercept_warn': 0.2, 'intercept_strong': 0.4}}
calib = rep['calibration']
calib['reliability_table']
| bin | lower | upper | count | mean_p | frac_pos | abs_error | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0.1 | 5089 | 0.044279 | 0.054431 | 0.010152 |
| 1 | 1 | 0.1 | 0.2 | 1724 | 0.148308 | 0.172274 | 0.023965 |
| 2 | 2 | 0.2 | 0.3 | 1171 | 0.245443 | 0.231426 | 0.014017 |
| 3 | 3 | 0.3 | 0.4 | 626 | 0.341419 | 0.316294 | 0.025125 |
| 4 | 4 | 0.4 | 0.5 | 251 | 0.440675 | 0.382470 | 0.058205 |
| 5 | 5 | 0.5 | 0.6 | 95 | 0.541182 | 0.378947 | 0.162235 |
| 6 | 6 | 0.6 | 0.7 | 107 | 0.649163 | 0.588785 | 0.060378 |
| 7 | 7 | 0.7 | 0.8 | 214 | 0.754985 | 0.785047 | 0.030061 |
| 8 | 8 | 0.8 | 0.9 | 427 | 0.851461 | 0.859485 | 0.008024 |
| 9 | 9 | 0.9 | 1.0 | 296 | 0.932652 | 0.888514 | 0.044139 |
Score#
We need this score refutation tests for:
Catch overfitting/leakage: The out-of-sample moment check verifies that the AIPW score averages to ~0 on held-out folds using fold-specific θ and nuisances. If this fails, your effect can be an artifact of leakage or overfit learners rather than a real signal.
Verify Neyman orthogonality in practice: The Gateaux-derivative tests (orthogonality_derivatives) check that small, targeted perturbations to the nuisances (g₀, g₁, m) don’t move the score mean. Large |t| values flag miscalibration (e.g., biased propensity or outcome models) that breaks the orthogonality protection DML relies on.
Assess finite-sample stability: The influence diagnostics reveal heavy tails (p99/median, kurtosis) and top-influential points. Spiky ψ implies high variance and sensitivity—often due to near-0/1 propensities, poor overlap, or outliers.
ATTE-specific risks: For ATT/ATTE, only g₀ and m matter in the score. The added overlap metrics and trim curves show how reliant your estimate is on scarce, high-m controls—common failure mode for ATT.
from causalis.refutation.score.score_validation import run_score_diagnostics
rep_score = run_score_diagnostics(res=ate_result)
rep_score["summary"]
| metric | value | flag | |
|---|---|---|---|
| 0 | se_plugin | 6.233980e-02 | NA |
| 1 | psi_p99_over_med | 2.374779e+01 | RED |
| 2 | psi_kurtosis | 3.032000e+02 | RED |
| 3 | max_|t|_g1 | 4.350018e+00 | RED |
| 4 | max_|t|_g0 | 2.076780e+00 | YELLOW |
| 5 | max_|t|_m | 1.030583e+00 | GREEN |
| 6 | oos_tstat_fold | -2.552943e-15 | GREEN |
| 7 | oos_tstat_strict | -2.461798e-15 | GREEN |
psi_p99_over_med#
Let \(\psi_i\) be the per-unit influence value (EIF score) for your estimator. We look at magnitudes \(a_i \equiv |\psi_i|\).
Define the 99th percentile and the median of these magnitudes:
\[ q_{0.99} \equiv \operatorname{Quantile}_{0.99}(a_1,\dots,a_n),\qquad m \equiv \operatorname{median}(a_1,\dots,a_n). \]The metric is the scale-free tail ratio:
\[ \boxed{ \texttt{psi\_p99\_over\_med} = \frac{q_{0.99}}{m} } \]
Why this works (brief):
Uses \(|\psi_i|\) to ignore sign (only tail size matters).
Dividing by the median makes it scale-invariant and robust to a few large values.
Large values \(\big(\gg 1\big)\) mean a small fraction of observations dominate uncertainty (heavy tails → unstable SE).
Quick read:
\(\approx 1!-!5\): tails tame/stable
\(\gtrsim 10\): caution (heavy tails)
\(\gtrsim 20\): likely unstable; check overlap, trim/clamp propensities, or robustify learners.
rep_score['influence_diagnostics']
{'se_plugin': 0.06233979878689177,
'kurtosis': 303.1999961346597,
'p99_over_med': 23.747788009323845,
'top_influential': i psi m res_t res_c
0 6224 205.671733 0.010000 2.062391 0.0
1 1915 -180.280657 0.012198 -2.205898 -0.0
2 215 -163.974979 0.013644 -2.221508 -0.0
3 9389 131.757805 0.010585 1.393678 0.0
4 868 -101.111026 0.014727 -1.489752 -0.0
5 2741 -96.602896 0.024285 -2.345406 -0.0
6 1993 83.941140 0.028412 2.404237 0.0
7 9894 -82.585292 0.011016 -0.907962 -0.0
8 2350 70.269293 0.029162 2.063988 0.0
9 1499 70.103947 0.022092 1.576351 0.0}
psi_kurtosis#
Let \(\psi_i\) be the per-unit influence values and define centered residuals
\[ \tilde\psi_i \equiv \psi_i - \bar\psi,\qquad \bar\psi \equiv \frac{1}{n}\sum_{i=1}^n \psi_i. \]Sample variance (with Bessel correction):
\[ s^2 \equiv \frac{1}{n-1}\sum_{i=1}^n \tilde\psi_i^2. \]Sample 4th central moment:
\[ \hat\mu_4 \equiv \frac{1}{n}\sum_{i=1}^n \tilde\psi_i^4. \]The reported metric (raw kurtosis, not excess):
\[ \boxed{ \texttt{psi\_kurtosis} = \frac{\hat{\mu}_4}{s^4} } \]
Interpretation (quick):
Normal reference \(\approx 3\) (excess kurtosis \(=0\)).
Much larger \(\Rightarrow\) heavier tails / more extreme \(\psi_i\) outliers.
Rules of thumb used in the diagnostics: \(\ge 10\) = caution, \(\ge 30\) = severe.
rep_score['influence_diagnostics']
{'se_plugin': 0.06233979878689177,
'kurtosis': 303.1999961346597,
'p99_over_med': 23.747788009323845,
'top_influential': i psi m res_t res_c
0 6224 205.671733 0.010000 2.062391 0.0
1 1915 -180.280657 0.012198 -2.205898 -0.0
2 215 -163.974979 0.013644 -2.221508 -0.0
3 9389 131.757805 0.010585 1.393678 0.0
4 868 -101.111026 0.014727 -1.489752 -0.0
5 2741 -96.602896 0.024285 -2.345406 -0.0
6 1993 83.941140 0.028412 2.404237 0.0
7 9894 -82.585292 0.011016 -0.907962 -0.0
8 2350 70.269293 0.029162 2.063988 0.0
9 1499 70.103947 0.022092 1.576351 0.0}
\(max_|t|_g1\), \(max_|t|_g0\), \(max_|t|_m\)#
We work with a basis of functions
Let \((m_i^\tau \equiv \mathrm{clip}(m_i,\tau,1-\tau))\) be the clipped propensity (guards against division by zero).
ATE case#
For each basis function \((b)\), form a sample mean (Gateaux derivative estimator) and its standard error, then compute a t-statistic; finally take the maximum absolute value across bases.
\((g_1)\) direction#
\((g_0)\) direction#
\((m)\) direction#
Interpretation: under Neyman orthogonality, each derivative mean \((\widehat d_{\bullet,b})\) should be approximately zero, so all \((|t_{\bullet,b}|)\) should be small. Large \((\max_{|t|})\) values flag miscalibration of the corresponding nuisance.
ATTE / ATT case#
Let \((p_1 = \mathbb{E}[D])\) and define the odds \((o_i = m_i^\tau / (1 - m_i^\tau))\).
The \((g_1)\) derivative is identically zero:
\[ \Rightarrow\quad \max_{|t|_{g_1}} = 0. \]\((g_0)\) direction
\[ \widehat d_{g_0,b} = \frac{1}{n}\sum_i h_b(X_i)\frac{(1-D_i)o_i - D_i}{p_1}, \qquad t_{g_0,b} = \frac{\widehat d_{g_0,b}}{\mathrm{se}(\widehat d_{g_0,b})}, \qquad \max_{|t|_{g_0}} = \max_b |t_{g_0,b}|. \]\((m)\) direction
\[ \widehat d_{m,b} = -\frac{1}{n}\sum_i h_b(X_i) \frac{(1-D_i)(Y_i - g_{0,i})} {p_1(1 - m_i^\tau)^2}, \qquad \max_{|t|_{m}} = \max_b |t_{m,b}|. \]
Rule of thumb: \((\max_{|t|} \lesssim 2)\) is “okay”; larger values indicate orthogonality breakdown — fix by recalibrating that nuisance, changing learners, features, or trimming.
rep_score['orthogonality_derivatives']
| basis | d_g1 | se_g1 | t_g1 | d_g0 | se_g0 | t_g0 | d_m | se_m | t_m | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | -0.233097 | 0.053585 | -4.350018 | 0.036084 | 0.017458 | 2.066835 | 0.314777 | 3.739844 | 0.084168 |
| 1 | 1 | -0.012467 | 0.058847 | -0.211863 | 0.029152 | 0.025305 | 1.152026 | 0.598770 | 3.992048 | 0.149991 |
| 2 | 2 | 0.021350 | 0.060963 | 0.350206 | 0.038320 | 0.022736 | 1.685394 | -5.950311 | 5.773734 | -1.030583 |
| 3 | 3 | 0.125716 | 0.061772 | 2.035176 | 0.047856 | 0.023043 | 2.076780 | 2.428545 | 5.321692 | 0.456348 |
| 4 | 4 | 0.007767 | 0.047830 | 0.162379 | 0.052762 | 0.029293 | 1.801146 | -1.800507 | 2.686426 | -0.670224 |
| 5 | 5 | 0.007035 | 0.054763 | 0.128462 | 0.002395 | 0.015985 | 0.149811 | 2.012890 | 3.491102 | 0.576577 |
oos_tstat_fold, oos_tstat_strict#
Here’s the math behind the two OOS (out-of-sample) moment t-stats used in the diagnostics. Assume K-fold cross-fitting with held-out index sets \((I_k)\) (size \(n_k\)) and complements \((R_k)\).
Step 1 — Leave-fold-out \((\hat\theta_{-k})\)#
For the moment condition \((\mathbb{E}[\psi_a(W)\,\theta + \psi_b(W)] = 0)\), the leave-fold-out estimate used on fold \((k)\) is
Step 2 — Held-out scores on fold \((k)\)#
Define the fold-specific held-out score for \(i\in I_k\):
Compute per-fold mean and variance:
OOS t-stat diagnostics#
\((\texttt{oos\_tstat\_fold})\)#
A fold-aggregated, variance-weighted t-statistic:
Intuition: averages fold means and scales by a fold-pooled standard error.
\((\texttt{oos\_tstat\_strict})\)#
A “strict” t-stat using every held-out observation directly:
Intuition: computes a single overall mean and standard error across all held-out scores (often slightly more conservative).
Interpretation#
Under a valid design and correct cross-fitting (so that \(\mathbb{E}[\psi]=0\) out-of-sample), both statistics are approximately standard normal:
Values near \(0\) indicate that the moment condition holds out of sample. Large \(|t|\) suggests overfitting, leakage, or nuisance miscalibration.
rep_score['oos_moment_test']
{'fold_results': fold n psi_mean psi_var
0 0 2500 -0.002503 37.561660
1 1 2500 0.100558 54.360122
2 2 2500 0.068724 31.028728
3 3 2500 -0.166779 32.522161,
'tstat_fold_agg': -2.5529434141490394e-15,
'pvalue_fold_agg': 0.999999999999998,
'tstat_strict': -2.461798420221801e-15,
'pvalue_strict': 0.999999999999998,
'interpretation': 'Near 0 indicates moment condition holds.'}
SUTVA#
print_sutva_questions()
1.) Are your clients independent (i)?
2.) Do you measure confounders, treatment, and outcome in the same intervals?
3.) Do you measure confounders before treatment and outcome after?
4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?
Those assumptions are statistically untestable. We need design of research for them
Uncofoundedness#
from causalis.refutation.unconfoundedness.uncofoundedness_validation import run_unconfoundedness_diagnostics
rep_uc = run_unconfoundedness_diagnostics(res=ate_result)
rep_uc['summary']
| metric | value | flag | |
|---|---|---|---|
| 0 | balance_max_smd | 0.144968 | YELLOW |
| 1 | balance_frac_violations | 0.200000 | YELLOW |
balance\_max\_smd#
For each covariate \((X_j)\), the (weighted) standardized mean difference is
Group means and variances are computed under the IPW weights implied by your estimand:
ATE: \(w_{1i} = \tfrac{D_i}{\hat m_i}\), \(w_{0i} = \tfrac{1-D_i}{1-\hat m_i}\)
ATTE: \(w_{1i} = D_i\), \(w_{0i} = (1-D_i)\tfrac{\hat m_i}{1-\hat m_i}\)
(If normalize=True, each weight vector is divided by its mean.)
Weighted means and variances:
Special cases in the code:
If both variances are \(\approx 0\) and \(|\mu_{1j}-\mu_{0j}| \approx 0\) ⇒ \(\mathrm{SMD}_j = 0\)
If both variances are \(\approx 0\) but means differ ⇒ \(\mathrm{SMD}_j = \infty\)
If denominator is \(\approx 0\) otherwise ⇒ \(\mathrm{SMD}_j = \text{NaN}\)
Then
implemented as a nanmax over the vector of \(\mathrm{SMD}_j\).
NaNs are ignored; if any feature produced \(\infty\), the max is \(\infty\).
balance\_frac\_violations#
Let the SMD threshold be \(\tau\) (default \(0.10\)). Define the set of finite SMDs:
Then the fraction of violations is
So it’s the share of covariates whose weighted SMD exceeds the threshold, computed only over finite SMDs (NaN / Inf are excluded from the denominator).
Quick interpretation#
Smaller is better. A common rule of thumb is \(\mathrm{SMD} \le 0.10\).
balance_max_smdtells you the worst residual imbalance across covariates.balance_frac_violationstells you how many covariates (as a fraction) still exceed the chosen threshold.
Sensitivity analysis#
1) sensitivity_analysis: bias-aware CI#
Goal. Start from your estimator \((\hat\theta)\) with sampling standard error \((se)\). Allow a controlled amount of worst-case hidden confounding through three knobs \((cf_y, cf_d, \rho)\). Inflate the uncertainty by an additive “max bias”.
Step A — Sampling part#
Point estimate \(\hat\theta\), standard error \(se\), and \(z_\alpha\) for level \((1-\alpha)\).
Usual sampling CI:
- \[ [\,\hat\theta - z_\alpha\,se,\ \hat\theta + z_\alpha\,se\,]. \]
Step B — Confounding geometry#
The code pulls sensitivity elements from the fitted IRM:
\(\sigma^2\): the asymptotic variance of the estimator’s EIF (so that \(se = \sqrt{\sigma^2}\) in the module’s normalization).
\(m_\alpha(i) \ge 0\): per-unit weight for the outcome channel (how outcome-model misspecification moves the EIF).
\(r(i)\) (“riesz_rep”): per-unit weight for the treatment channel (how propensity-model misspecification moves the EIF).
We turn the user’s sensitivity knobs into a quadratic budget for adversarial confounding: $$ \begin{aligned} a_i &:= \sqrt{2,m_\alpha(i)}, \[1mm] b_i &:= \begin{cases} |r(i)|, & \text{(default, worst-case sign)} \ r(i), & \text{(if \texttt{use_signed_rr=True})} \end{cases} \[2mm] \text{base}_i &= a_i^2,cf_y + b_i^2,cf_d
2,\rho,\sqrt{cf_y,cf_d},a_i b_i \ge 0, \[2mm] \nu^2 &:= \mathbb{E}_n[\text{base}_i]. \end{aligned} $$
\(cf_y \ge 0\): strength of unobserved outcome disturbance
\(cf_d \ge 0\): strength of unobserved treatment disturbance
\(\rho \in [-1,1]\): their correlation
This \(\nu^2\) is a dimensionless bias multiplier — how sensitive the EIF is to those perturbations.
Step C — Max bias and intervals#
Two equivalent forms appear in the code:
Then the module reports:
Confounding bounds for \(\theta\):
\[ [\,\hat\theta - \text{max\_bias},\; \hat\theta + \text{max\_bias}\,]. \]Bias-aware CI (sampling + confounding, worst-case additive):
\[ \Big[\,\hat\theta - (\text{max\_bias} + z_\alpha\,se),\; \hat\theta + (\text{max\_bias} + z_\alpha\,se)\,\Big]. \]
(So you’re adding sampling error and the adversarial bias linearly for a conservative envelope.)
Notes & edge handling
Numeric PSD clamping ensures \(\text{base}_i \ge 0\); \(\rho\) is clipped to \([-1,1]\).
If \(cf_y = cf_d = 0 \Rightarrow \nu^2 = 0 \Rightarrow\) bias-aware CI collapses to the sampling CI.
Internally, a delta-method IF for \(\text{max\_bias}\) is
\[ \psi_{\text{max}}(i) = \frac{\sigma^2\,\psi_{\nu^2}(i) + \nu^2\,\psi_{\sigma^2}(i)} {2\,\text{max\_bias}}, \]matching \(\text{max\_bias} = \sqrt{\sigma^2\nu^2}\) (used for coherent summaries).
2) sensitivity_benchmark: calibrating \((cf_y, cf_d, \rho)\) from omitted covariates#
Goal.
Pick a set \(Z\) of candidate “omitted” covariates (the benchmarking_set).
Refit a short IRM that excludes \(Z\) and compare it to the long (original) model.
Use how well \(Z\) explains residual variation to derive plausible \((cf_y, cf_d, \rho)\).
Step A — Long vs short estimates#
Long: \(\hat\theta_{\text{long}}\) (original model).
Short: \(\hat\theta_{\text{short}}\) (drop \(Z\), same learners/hyperparams).
Report \(\Delta = \hat\theta_{\text{long}} - \hat\theta_{\text{short}}\).
Step B — Residuals from the long model#
Let \((g_1, g_0, \hat m)\) be the outcome and propensity learners:
These are the EIF’s outcome and treatment residual components.
Step C — How much of each residual does \(Z\) explain?#
Regress \(r_y\) on \(Z\) and \(r_d\) on \(Z\) (unweighted OLS; ATT case uses ATT weights):
Obtain \(R^2_y\) and \(R^2_d\).
Convert to signal-to-noise ratios (the “strength” of confounding channels):
\[ cf_y = \frac{R^2_y}{1 - R^2_y}, \qquad cf_d = \frac{R^2_d}{1 - R^2_d}. \](These are the same \(R^2 / (1 - R^2)\) maps used in modern partial-\(R^2\) robustness frameworks.)
Compute the correlation between the fitted pieces from those two regressions:
weighted for ATT when applicable, then clipped to \([-1,1]\).
Outputs#
A one-row DataFrame (indexed by the treatment name) with
You can pass \((cf_y, cf_d, \rho)\) straight into sensitivity_analysis to get the associated bias-aware interval.
Intuitively, this calibrates how strong hidden stuff would need to be by using a concrete, observed proxy \(Z\).
How to read them together#
Use
sensitivity_benchmarkwith a plausible omitted set \(Z\) to derive \((cf_y, cf_d, \rho)\) and observe the actual estimate shift \(\Delta\).Plug those \((cf_y, cf_d, \rho)\) into
sensitivity_analysisto get:\[ \text{max\_bias} = \sqrt{\nu^2}\,se, \qquad \text{Bias-aware CI} = \hat\theta \pm (\text{max\_bias} + z_\alpha\,se). \]
Small \(cf\) values (or \(\rho \approx 0\)) ⇒ tiny \(\nu^2\) ⇒ bias-aware CI near the sampling CI. Large \(cf\) values and \(|\rho|\approx 1\) widen it, reflecting stronger plausible hidden confounding.
from causalis.refutation.unconfoundedness.sensitivity import (
sensitivity_analysis, sensitivity_benchmark
)
sensitivity_analysis(ate_result, cf_y=0.01, cf_d=0.01, rho=1.0, level=0.95)
{'theta': 0.9917276396749556,
'se': 0.06233979878689177,
'level': 0.95,
'z': 1.959963984540054,
'sampling_ci': (0.869543879249174, 1.1139114001007373),
'theta_bounds_confounding': (0.9282979061474285, 1.0551573732024828),
'bias_aware_ci': (0.8061141457216469, 1.1773411336282644),
'max_bias': 0.06342973352752714,
'sigma2': 1.0690078122954034,
'nu2': 1.0352732234146504,
'params': {'cf_y': 0.01, 'cf_d': 0.01, 'rho': 1.0, 'use_signed_rr': False}}
sensitivity_benchmark(ate_result, benchmarking_set =['tenure_months'])
| cf_y | cf_d | rho | theta_long | theta_short | delta | |
|---|---|---|---|---|---|---|
| d | 0.000001 | 1.951733e-08 | -1.0 | 0.991728 | 1.064098 | -0.07237 |