Linear and Nonlinear Data Generating Process benchmarking#

Linear Data Generating Process (DGP)#

Let \(i=1,\dots,n\). Draw confounders:

\[ \text{tenure}_i \sim \mathcal N(24,\ 12^2) \]
\[ \text{sessions}_i \sim \mathcal N(5,\ 2^2) \]
\[ \text{spend}_i \sim \mathrm{Unif}(0,200) \]
\[ \text{prem}_i \sim \mathrm{Bernoulli}(0.25) \]
\[ \text{urban}_i \sim \mathrm{Bernoulli}(0.60) \]

Stack them as

\[ X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5 \]

Treatment model#

\[ m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + x^\top \beta_t\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}, \]

with \(\alpha_d\) calibrated (by bisection) so that \(\mathbb{E}[D]\approx 0.20\), and

\[ \beta_t^\top = \big[, 0.08,\ 0.12,\ 0.004,\ 0.25,\ 0.10 \big] \]

Then $\(D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)\)$


Outcome model#

\[ Y_i = \alpha_y + X_i^\top \beta_y + \Theta D_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2), \]

with \(\alpha_y=0,\ \sigma_y=1,\ \Theta=0.80\), and

\[ \beta_y^\top = \big[0.05,\ 0.60,\ 0.005,\ 0.80,\ 0.20 \big] \]

Oracle nuisances (IRM) and CATE#

\[ m(x) = \sigma\!\big(\alpha_d + x^\top \beta_t\big) \]
\[ g_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + x^\top \beta_y \]
\[ g_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + x^\top \beta_y + \Theta \]
\[ \mathrm{CATE}(x) = g_1(x) - g_0(x) = \Theta \quad \text{(constant)} \]

Targets#

\[ \Theta_{\text{ATE}} = \mathbb{E}\big[Y(1)-Y(0)\big] = \Theta, \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[Y(1)-Y(0)\mid D=1\big] = \Theta. \]

So under this constant-effect DGP:

\[ \Theta_{\text{ATE}} = \Theta_{\text{ATTE}} = 0.80. \]

Let’s generate the Data#

import numpy as np
from typing import List, Dict, Any

from causalis.data import CausalDatasetGenerator

confounder_specs: List[Dict[str, Any]] = [
    {"name": "tenure_months",     "dist": "normal",   "mu": 24, "sd": 12},
    {"name": "avg_sessions_week", "dist": "normal",   "mu": 5,  "sd": 2},
    {"name": "spend_last_month",  "dist": "uniform",  "a": 0,   "b": 200},
    {"name": "premium_user",      "dist": "bernoulli","p": 0.25},
    {"name": "urban_resident",    "dist": "bernoulli","p": 0.60},
]

# Moderate, sensible effects by column name (linear, well-specified)
# Outcome: higher sessions, tenure, spend, premium, urban -> higher Y
beta_y_map = {
    "tenure_months":     0.05,   # ~0.6 SD shift at +1 SD (12 months)
    "avg_sessions_week": 0.60,   # strong engagement signal
    "spend_last_month":  0.005,  # scale 0..200 => up to ~1 shift
    "premium_user":      0.80,
    "urban_resident":    0.20,
}

# Treatment score: moderate dependence on engagement, spend, premium, urban
beta_d_map = {
    "tenure_months":     0.08,
    "avg_sessions_week": 0.12,
    "spend_last_month":  0.004,
    "premium_user":      0.25,
    "urban_resident":    0.10,
}

def expand_beta_from_specs(specs: List[Dict[str, Any]], beta_map: Dict[str, float]) -> np.ndarray:
    """Create β aligned to the generator's X column order from confounder_specs."""
    betas = []
    for spec in specs:
        name = spec.get("name", "")
        dist = str(spec.get("dist", "normal")).lower()
        if dist in ("normal", "uniform", "bernoulli"):
            betas.append(beta_map.get(name, 0.0))
        else:
            raise ValueError(f"Unsupported dist in this simple setup: {dist}")
    return np.asarray(betas, dtype=float)

beta_y = expand_beta_from_specs(confounder_specs, beta_y_map)
beta_d = expand_beta_from_specs(confounder_specs, beta_d_map)

gen = CausalDatasetGenerator(
    theta=0.80,                 # constant treatment effect
    tau=None,                   # use theta
    beta_y=beta_y,              # linear connection between X and Y
    beta_d=beta_d,              # linear connection between X and D
    g_y=None, g_d=None,         # no nonlinearities
    alpha_y=0.0,                # no intercept
    alpha_d=0.0,                # no intercept
    sigma_y=1.0,                # noise of y
    outcome_type="continuous",  # Gaussian Y
    confounder_specs=confounder_specs,
    u_strength_d=0.0,           # strength of latent confounder influence on treatment
    u_strength_y=0.0,           # strength of latent confounder influence on outcome
    propensity_sharpness=1.0,   # increase to make overlap harder
    target_d_rate=0.20,         # rate of treatment assignment
    seed=123                    # random seed for reproducibility
)

n = 10_000                      # Number of observations
df = gen.generate(n)

print("Treatment share ≈", df["d"].mean())
true_ate = float(df["cate"].mean())
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
# Ground-truth ATT (on the natural scale): E[tau(X) | T=1] = mean CATE among the treated
true_att = float(df.loc[df["d"] == 1, "cate"].mean())
print(f"Ground-truth ATT from the DGP: {true_att:.3f}")
Treatment share ≈ 0.2052
Ground-truth ATE from the DGP: 0.800
Ground-truth ATT from the DGP: 0.800

Wrap it in CausalData Object#

from causalis.data import CausalData

causal_data = CausalData(
    df=df,
    treatment="d",
    outcome="y",
    confounders=["tenure_months",
                 "avg_sessions_week",
                 "spend_last_month",
                 "premium_user",
                 "premium_user",
                 "urban_resident"]
)

causal_data.df.head()
y d tenure_months avg_sessions_week spend_last_month premium_user urban_resident
0 1.903910 0.0 12.130544 4.056687 181.570607 0.0 0.0
1 3.388144 0.0 19.586560 1.671561 182.793598 0.0 0.0
2 8.456512 1.0 39.455103 5.452889 125.185708 1.0 1.0
3 5.535970 1.0 26.327693 5.051629 4.932905 0.0 1.0
4 4.965140 1.0 35.042771 4.933996 23.577407 0.0 0.0

Estimate ATE and ATTE#

from causalis.inference.ate import dml_ate

# Estimate Average Treatment Effect (ATE)
ate_result = dml_ate(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATE = 0.8 VS Estimated = {ate_result.get('coefficient')} in {ate_result.get('confidence_interval')}")
Real ATE = 0.8 VS Estimated = 0.7506313322797796 in (0.6668061755942251, 0.8344564889653342)
from causalis.inference.atte import dml_atte

# Estimate Average Treatment Effect on Treatment (ATTE)
atte_result = dml_atte(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATTE = 0.8 VS Estimated = {atte_result.get('coefficient')} in {atte_result.get('confidence_interval')}")
Real ATTE = 0.8 VS Estimated = 0.8238669785039335 in (0.7558326760362799, 0.8919012809715872)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP

Nonlinear Data Generating Process (DGP)#

Let \(i=1,\dots,n\). Draw confounders:

\[ \text{tenure}_i \sim \mathcal N(24,\ 12^2) \]
\[ \text{sessions}_i \sim \mathcal N(5,\ 2^2) \]
\[ \text{spend}_i \sim \mathrm{Unif}(0,200) \]
\[ \text{prem}_i \sim \mathrm{Bernoulli}(0.25) \]
\[ \text{urban}_i \sim \mathrm{Bernoulli}(0.60) \]

Stack them as

\[ X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5 \]

Treatment model#

Define $\( m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + g_d(x)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}, \)$

with \(\alpha_d\) calibrated (by bisection) so that \(\mathbb{E}[D]\approx 0.20\).

The nonlinear score includes alignment with the treatment effect \(\tau(x)\):

\[ \boxed{ \begin{aligned} g_d(x) &= 1.10\tanh\big(0.06(\text{spend}-100)\big) + 1.00\sigma\big(0.60(\text{sessions}-5)\big) + 0.50\log\big(1+\text{tenure}\big) + 0.50\text{prem} + 0.25\text{urban} + 0.90\text{prem}\cdot\mathbb{\text{spend}>120} + 0.30\text{urban}\cdot\mathbb{\text{tenure}<12} +0.80\tau(x) \end{aligned} } \]

where \(z_+ = \max(z,0)\) and \(\mathbb{1}{\cdot}\) is the indicator.

Then

\[ D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big) \]

Outcome model#

\[ Y_i = \alpha_y + g_y(X_i) + D_i\tau(X_i) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2), \]

with \(\alpha_y=0,\ \sigma_y=1\).

The baseline outcome component is nonlinear:

\[ \boxed{ \begin{aligned} g_y(x) &= 0.70\tanh\big(0.03(\text{spend}-80)\big) + 0.50\sqrt{\text{sessions}} + 0.40\log\big(1+\text{tenure}\big) + 0.30\text{prem} + 0.10\text{urban} - 0.10\mathbb{\text{spend}<20} \end{aligned} } \]

The heterogeneous treatment effect (CATE) is

\[ \boxed{ \begin{aligned} \tau(x) &= 0.40 + 0.60\sigma\big(-0.40(\text{sessions}-5)\big) + 2.00\text{prem}\cdot\mathbb{\text{spend}>120} \ + 0.10\text{urban}\cdot\mathbb{\text{tenure}<12} \end{aligned} } \]

Oracle nuisances (IRM) and CATE#

\[ m(x) = \sigma\big(\alpha_d + g_d(x)\big) \]
\[ g_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + g_y(x) \]
\[ g_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + g_y(x) + \tau(x) \]
\[ \mathrm{CATE}(x) = g_1(x) - g_0(x) = \tau(x) \]

Targets#

\[ \Theta_{\text{ATE}} = \mathbb{E}\big[\tau(X)\big], \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[\tau(X)\mid D=1\big]. \]
import numpy as np
from typing import List, Dict, Any, Tuple
from causalis.data import CausalDatasetGenerator

# 1) Confounders
confounder_specs: List[Dict[str, Any]] = [
    {"name": "tenure_months",     "dist": "normal",    "mu": 24, "sd": 12},
    {"name": "avg_sessions_week", "dist": "normal",    "mu": 5,  "sd": 2},
    {"name": "spend_last_month",  "dist": "uniform",   "a": 0,   "b": 200},
    {"name": "premium_user",      "dist": "bernoulli", "p": 0.25},
    {"name": "urban_resident",    "dist": "bernoulli", "p": 0.60},
]

# 2) Feature index map
def feature_indices_from_specs(specs: List[Dict[str, Any]]) -> Dict[str, Tuple[int, ...]]:
    idx, out = 0, {}
    for spec in specs:
        name = spec.get("name", "")
        dist = str(spec.get("dist","normal")).lower()
        if dist in ("normal","uniform","bernoulli"):
            out[name] = (idx,); idx += 1
        else:
            raise ValueError(f"Unsupported dist: {dist}")
    return out

feat = feature_indices_from_specs(confounder_specs)
def col(X, key): return X[:, feat[key][0]]
def _log1p_pos(x): return np.log1p(np.clip(x, 0.0, None))
def _sqrt_pos(x):  return np.sqrt(np.clip(x, 0.0, None))
def _ind(cond):    return cond.astype(float)
def _sigmoid(z):   return 1.0 / (1.0 + np.exp(-z))

# 3) g_d(x) -   nonlinear connection between X and D
def g_d(X: np.ndarray) -> np.ndarray:
    tenure = col(X, "tenure_months")
    sess   = col(X, "avg_sessions_week")
    spend  = col(X, "spend_last_month")
    prem   = col(X, "premium_user")
    urban  = col(X, "urban_resident")
    tau_align = tau_func(X)                                       # explicit alignment
    return (
        1.10 * np.tanh(0.06*(spend - 100.0))
      + 1.00 * _sigmoid(0.60*(sess - 5.0))
      + 0.50 * _log1p_pos(tenure)
      + 0.50 * prem
      + 0.25 * urban
      + 0.90 * prem * _ind(spend > 120.0)
      + 0.30 * urban * _ind(tenure < 12.0)
      + 0.80 * tau_align                                         # direct alignment term (λ)
    )


# 4) g_y(x)  -   nonlinear connection between X and Y
def g_y(X: np.ndarray) -> np.ndarray:
    tenure = col(X, "tenure_months")
    sess   = col(X, "avg_sessions_week")
    spend  = col(X, "spend_last_month")
    prem   = col(X, "premium_user")
    urban  = col(X, "urban_resident")
    return (
        0.70 * np.tanh(0.03*(spend - 80.0))
      + 0.50 * _sqrt_pos(sess)
      + 0.40 * _log1p_pos(tenure)
      + 0.30 * prem
      + 0.10 * urban
      - 0.10 * _ind(spend < 20.0)
    )

# 5) tau(x)  — nonlinear effect function (CATE)
def tau_func(X: np.ndarray) -> np.ndarray:
    tenure = col(X, "tenure_months")
    sess   = col(X, "avg_sessions_week")
    spend  = col(X, "spend_last_month")
    prem   = col(X, "premium_user")
    urban  = col(X, "urban_resident")
    return (
        0.40
      + 0.60 * (1.0 / (1.0 + np.exp(-0.40*(sess - 5.0))))  # sigmoid
      + 2 * prem * _ind(spend > 120.0)
      + 0.10 * urban * _ind(tenure < 12.0)
    )

# 6) Generator — continuous outcome
gen = CausalDatasetGenerator(
    theta=0.0,                 # ignored; we pass tau
    tau=tau_func,              # nonlinear effect
    beta_y=None, beta_d=None,  # use nonlinear g_* only
    g_y=g_y, g_d=g_d,          # nonlinear functions
    alpha_y=0.0,               # baseline mean level, intercept
    alpha_d=0.0,               # will be calibrated to target_d_rate
    sigma_y=1.0,               # noise std for Y
    outcome_type="continuous", # outcome distribution
    confounder_specs=confounder_specs,
    target_d_rate=0.20,        # 20% will be treated
    u_strength_d=0.0,          # strength of latent confounder influence on treatment
    u_strength_y=0.0,          # strength of latent confounder influence on outcome
    propensity_sharpness=1,    # increase to make overlap harder
    seed=123                   # random seed for reproducibility
)

# 7) Generate
n = 10_000                     # Number of observations
df = gen.generate(n)


print("Treatment share ≈", df["d"].mean())
true_ate = float(df["cate"].mean())
print(f"Ground-truth ATE from the DGP: {true_ate:.3f}")
# Ground-truth ATT (on the natural scale): E[tau(X) | T=1] = mean CATE among the treated
true_att = float(df.loc[df["d"] == 1, "cate"].mean())
print(f"Ground-truth ATT from the DGP: {true_att:.3f}")
Treatment share ≈ 0.2036
Ground-truth ATE from the DGP: 0.913
Ground-truth ATT from the DGP: 1.567
from causalis.data import CausalData

causal_data = CausalData(
    df=df,
    treatment="d",
    outcome="y",
    confounders=["tenure_months",
                 "avg_sessions_week",
                 "spend_last_month",
                 "premium_user",
                 "urban_resident"]
)

causal_data.df.head()
y d tenure_months avg_sessions_week spend_last_month premium_user urban_resident
0 0.689404 0.0 12.130544 4.056687 181.570607 0.0 0.0
1 3.045282 0.0 19.586560 1.671561 182.793598 0.0 0.0
2 7.173595 1.0 39.455103 5.452889 125.185708 1.0 1.0
3 1.926216 0.0 26.327693 5.051629 4.932905 0.0 1.0
4 1.225088 0.0 35.042771 4.933996 23.577407 0.0 0.0
from causalis.inference.ate import dml_ate

# Estimate Average Treatment Effect (ATE)
ate_result = dml_ate(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATE = {true_ate:.3f} VS Estimated = {ate_result.get('coefficient')} in {ate_result.get('confidence_interval')}")
Real ATE = 0.913 VS Estimated = 0.9917276396749556 in (0.869543879249174, 1.1139114001007373)
from causalis.inference.atte import dml_atte

# Estimate Average Treatment Effect on Treatment (ATTE)
atte_result = dml_atte(causal_data, n_folds=4, normalize_ipw=False, store_diagnostic_data=False, random_state=123)
print(f"Real ATTE = {true_att:.3f} VS Estimated = {atte_result.get('coefficient')} in {atte_result.get('confidence_interval')}")
Real ATTE = 1.567 VS Estimated = 1.6433239376940343 in (1.4972790851652928, 1.7893687902227757)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP