Linear and Nonlinear Data Generating Process benchmarking#
Linear Data Generating Process (DGP)#
Let \(i=1,\dots,n\). Draw confounders:
Stack them as
Treatment model#
with \(\alpha_d\) calibrated (by bisection) so that \(\mathbb{E}[D]\approx 0.20\), and
Then $\(D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)\)$
Outcome model#
with \(\alpha_y=0,\ \sigma_y=1,\ \Theta=0.80\), and
Oracle nuisances (IRM) and CATE#
Targets#
So under this constant-effect DGP:
Let’s generate the Data#
import numpy as np
from typing import List, Dict, Any
from causalis.data import CausalDatasetGenerator
confounder_specs: List[Dict[str, Any]] = [
{"name": "tenure_months", "dist": "normal", "mu": 24, "sd": 12},
{"name": "avg_sessions_week", "dist": "normal", "mu": 5, "sd": 2},
{"name": "spend_last_month", "dist": "uniform", "a": 0, "b": 200},
{"name": "premium_user", "dist": "bernoulli","p": 0.25},
{"name": "urban_resident", "dist": "bernoulli","p": 0.60},
]
# Moderate, sensible effects by column name (linear, well-specified)
# Outcome: higher sessions, tenure, spend, premium, urban -> higher Y
beta_y_map = {
"tenure_months": 0.05, # ~0.6 SD shift at +1 SD (12 months)
"avg_sessions_week": 0.60, # strong engagement signal
"spend_last_month": 0.005, # scale 0..200 => up to ~1 shift
"premium_user": 0.80,
"urban_resident": 0.20,
}
# Treatment score: moderate dependence on engagement, spend, premium, urban
beta_d_map = {
"tenure_months": 0.08,
"avg_sessions_week": 0.12,
"spend_last_month": 0.004,
"premium_user": 0.25,
"urban_resident": 0.10,
}
def expand_beta_from_specs(specs: List[Dict[str, Any]], beta_map: Dict[str, float]) -> np.ndarray:
"""Create β aligned to the generator's X column order from confounder_specs."""
betas = []
for spec in specs:
name = spec.get("name", "")
dist = str(spec.get("dist", "normal")).lower()
if dist in ("normal", "uniform", "bernoulli"):
betas.append(beta_map.get(name, 0.0))
else:
raise ValueError(f"Unsupported dist in this simple setup: {dist}")
return np.asarray(betas, dtype=float)
beta_y = expand_beta_from_specs(confounder_specs, beta_y_map)
beta_d = expand_beta_from_specs(confounder_specs, beta_d_map)
gen = CausalDatasetGenerator(
theta=0.80, # constant treatment effect
tau=None, # use theta
beta_y=beta_y, # linear connection between X and Y
beta_d=beta_d, # linear connection between X and D
g_y=None, g_d=None, # no nonlinearities
alpha_y=0.0, # no intercept
alpha_d=0.0, # no intercept
sigma_y=1.0, # noise of y
outcome_type="continuous", # Gaussian Y
confounder_specs=confounder_specs,
u_strength_d=0.0, # strength of latent confounder influence on treatment
u_strength_y=0.0, # strength of latent confounder influence on outcome
propensity_sharpness=1.0, # increase to make overlap harder
target_d_rate=0.20, # rate of treatment assignment
seed=123 # random seed for reproducibility
)
n = 10_000 # Number of observations
df = gen.generate(n)
print("Treatment share ≈", df["d"].mean())
true_ate = float(df["cate"].mean())
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
# Ground-truth ATT (on the natural scale): E[tau(X) | T=1] = mean CATE among the treated
true_att = float(df.loc[df["d"] == 1, "cate"].mean())
print(f"Ground-truth ATT from the DGP: {true_att:.3f}")
Treatment share ≈ 0.2052
Ground-truth ATE from the DGP: 0.800
Ground-truth ATT from the DGP: 0.800
Wrap it in CausalData Object#
from causalis.data import CausalData
causal_data = CausalData(
df=df,
treatment="d",
outcome="y",
confounders=["tenure_months",
"avg_sessions_week",
"spend_last_month",
"premium_user",
"premium_user",
"urban_resident"]
)
causal_data.df.head()
| y | d | tenure_months | avg_sessions_week | spend_last_month | premium_user | urban_resident | |
|---|---|---|---|---|---|---|---|
| 0 | 1.903910 | 0.0 | 12.130544 | 4.056687 | 181.570607 | 0.0 | 0.0 |
| 1 | 3.388144 | 0.0 | 19.586560 | 1.671561 | 182.793598 | 0.0 | 0.0 |
| 2 | 8.456512 | 1.0 | 39.455103 | 5.452889 | 125.185708 | 1.0 | 1.0 |
| 3 | 5.535970 | 1.0 | 26.327693 | 5.051629 | 4.932905 | 0.0 | 1.0 |
| 4 | 4.965140 | 1.0 | 35.042771 | 4.933996 | 23.577407 | 0.0 | 0.0 |
Estimate ATE and ATTE#
from causalis.inference.ate import dml_ate
# Estimate Average Treatment Effect (ATE)
ate_result = dml_ate(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATE = 0.8 VS Estimated = {ate_result.get('coefficient')} in {ate_result.get('confidence_interval')}")
Real ATE = 0.8 VS Estimated = 0.7506313322797796 in (0.6668061755942251, 0.8344564889653342)
from causalis.inference.atte import dml_atte
# Estimate Average Treatment Effect on Treatment (ATTE)
atte_result = dml_atte(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATTE = 0.8 VS Estimated = {atte_result.get('coefficient')} in {atte_result.get('confidence_interval')}")
Real ATTE = 0.8 VS Estimated = 0.8238669785039335 in (0.7558326760362799, 0.8919012809715872)
The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP
Nonlinear Data Generating Process (DGP)#
Let \(i=1,\dots,n\). Draw confounders:
Stack them as
Treatment model#
Define $\( m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + g_d(x)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}, \)$
with \(\alpha_d\) calibrated (by bisection) so that \(\mathbb{E}[D]\approx 0.20\).
The nonlinear score includes alignment with the treatment effect \(\tau(x)\):
where \(z_+ = \max(z,0)\) and \(\mathbb{1}{\cdot}\) is the indicator.
Then
Outcome model#
with \(\alpha_y=0,\ \sigma_y=1\).
The baseline outcome component is nonlinear:
The heterogeneous treatment effect (CATE) is
Oracle nuisances (IRM) and CATE#
Targets#
import numpy as np
from typing import List, Dict, Any, Tuple
from causalis.data import CausalDatasetGenerator
# 1) Confounders
confounder_specs: List[Dict[str, Any]] = [
{"name": "tenure_months", "dist": "normal", "mu": 24, "sd": 12},
{"name": "avg_sessions_week", "dist": "normal", "mu": 5, "sd": 2},
{"name": "spend_last_month", "dist": "uniform", "a": 0, "b": 200},
{"name": "premium_user", "dist": "bernoulli", "p": 0.25},
{"name": "urban_resident", "dist": "bernoulli", "p": 0.60},
]
# 2) Feature index map
def feature_indices_from_specs(specs: List[Dict[str, Any]]) -> Dict[str, Tuple[int, ...]]:
idx, out = 0, {}
for spec in specs:
name = spec.get("name", "")
dist = str(spec.get("dist","normal")).lower()
if dist in ("normal","uniform","bernoulli"):
out[name] = (idx,); idx += 1
else:
raise ValueError(f"Unsupported dist: {dist}")
return out
feat = feature_indices_from_specs(confounder_specs)
def col(X, key): return X[:, feat[key][0]]
def _log1p_pos(x): return np.log1p(np.clip(x, 0.0, None))
def _sqrt_pos(x): return np.sqrt(np.clip(x, 0.0, None))
def _ind(cond): return cond.astype(float)
def _sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
# 3) g_d(x) - nonlinear connection between X and D
def g_d(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
tau_align = tau_func(X) # explicit alignment
return (
1.10 * np.tanh(0.06*(spend - 100.0))
+ 1.00 * _sigmoid(0.60*(sess - 5.0))
+ 0.50 * _log1p_pos(tenure)
+ 0.50 * prem
+ 0.25 * urban
+ 0.90 * prem * _ind(spend > 120.0)
+ 0.30 * urban * _ind(tenure < 12.0)
+ 0.80 * tau_align # direct alignment term (λ)
)
# 4) g_y(x) - nonlinear connection between X and Y
def g_y(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
return (
0.70 * np.tanh(0.03*(spend - 80.0))
+ 0.50 * _sqrt_pos(sess)
+ 0.40 * _log1p_pos(tenure)
+ 0.30 * prem
+ 0.10 * urban
- 0.10 * _ind(spend < 20.0)
)
# 5) tau(x) — nonlinear effect function (CATE)
def tau_func(X: np.ndarray) -> np.ndarray:
tenure = col(X, "tenure_months")
sess = col(X, "avg_sessions_week")
spend = col(X, "spend_last_month")
prem = col(X, "premium_user")
urban = col(X, "urban_resident")
return (
0.40
+ 0.60 * (1.0 / (1.0 + np.exp(-0.40*(sess - 5.0)))) # sigmoid
+ 2 * prem * _ind(spend > 120.0)
+ 0.10 * urban * _ind(tenure < 12.0)
)
# 6) Generator — continuous outcome
gen = CausalDatasetGenerator(
theta=0.0, # ignored; we pass tau
tau=tau_func, # nonlinear effect
beta_y=None, beta_d=None, # use nonlinear g_* only
g_y=g_y, g_d=g_d, # nonlinear functions
alpha_y=0.0, # baseline mean level, intercept
alpha_d=0.0, # will be calibrated to target_d_rate
sigma_y=1.0, # noise std for Y
outcome_type="continuous", # outcome distribution
confounder_specs=confounder_specs,
target_d_rate=0.20, # 20% will be treated
u_strength_d=0.0, # strength of latent confounder influence on treatment
u_strength_y=0.0, # strength of latent confounder influence on outcome
propensity_sharpness=1, # increase to make overlap harder
seed=123 # random seed for reproducibility
)
# 7) Generate
n = 10_000 # Number of observations
df = gen.generate(n)
print("Treatment share ≈", df["d"].mean())
true_ate = float(df["cate"].mean())
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
# Ground-truth ATT (on the natural scale): E[tau(X) | T=1] = mean CATE among the treated
true_att = float(df.loc[df["d"] == 1, "cate"].mean())
print(f"Ground-truth ATT from the DGP: {true_att:.3f}")
Treatment share ≈ 0.2036
Ground-truth ATE from the DGP: 0.913
Ground-truth ATT from the DGP: 1.567
from causalis.data import CausalData
causal_data = CausalData(
df=df,
treatment="d",
outcome="y",
confounders=["tenure_months",
"avg_sessions_week",
"spend_last_month",
"premium_user",
"urban_resident"]
)
causal_data.df.head()
| y | d | tenure_months | avg_sessions_week | spend_last_month | premium_user | urban_resident | |
|---|---|---|---|---|---|---|---|
| 0 | 0.689404 | 0.0 | 12.130544 | 4.056687 | 181.570607 | 0.0 | 0.0 |
| 1 | 3.045282 | 0.0 | 19.586560 | 1.671561 | 182.793598 | 0.0 | 0.0 |
| 2 | 7.173595 | 1.0 | 39.455103 | 5.452889 | 125.185708 | 1.0 | 1.0 |
| 3 | 1.926216 | 0.0 | 26.327693 | 5.051629 | 4.932905 | 0.0 | 1.0 |
| 4 | 1.225088 | 0.0 | 35.042771 | 4.933996 | 23.577407 | 0.0 | 0.0 |
from causalis.inference.ate import dml_ate
# Estimate Average Treatment Effect (ATE)
ate_result = dml_ate(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATE = {true_ate:.3f} VS Estimated = {ate_result.get('coefficient')} in {ate_result.get('confidence_interval')}")
Real ATE = 0.913 VS Estimated = 0.9917276396749556 in (0.869543879249174, 1.1139114001007373)
from causalis.inference.atte import dml_atte
# Estimate Average Treatment Effect on Treatment (ATTE)
atte_result = dml_atte(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATTE = {true_att:.3f} VS Estimated = {atte_result.get('coefficient')} in {atte_result.get('confidence_interval')}")
Real ATTE = 1.567 VS Estimated = 1.6433239376940343 in (1.4972790851652928, 1.7893687902227757)
The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP