Data Module#
The causalkit.data
module provides functions for generating synthetic data for causal inference tasks.
Overview#
This module includes functions for generating:
A/B test data with customizable parameters
Randomized Controlled Trial (RCT) data
Observational data for more complex causal inference scenarios
API Reference#
Container for causal inference datasets. |
|
Create synthetic RCT data using CausalDatasetGenerator as the core engine. |
CausalData#
- class causalkit.data.causaldata.CausalData(df, treatment, outcome, confounders=None)[source]#
Bases:
object
Container for causal inference datasets.
Wraps a pandas DataFrame and stores the names of treatment, outcome, and optional confounder columns. The stored DataFrame is restricted to only those columns.
- Parameters:
df (pd.DataFrame) – The DataFrame containing the data. Cannot contain NaN values. Only columns specified in outcome, treatment, and confounders will be stored.
treatment (str) – Column name representing the treatment variable.
outcome (str) – Column name representing the outcome (target) variable.
confounders (Union[str, List[str]], optional) – Column name(s) representing the confounders/covariates.
- df#
A copy of the original data restricted to [outcome, treatment] + confounders.
- Type:
pd.DataFrame
Examples
>>> from causalkit.data import generate_rct_data >>> from causalkit.data import CausalData >>> >>> # Generate data >>> df = generate_rct_data() >>> >>> # Create CausalData object >>> causal_data = CausalData( ... df=df, ... treatment='treatment', ... outcome='outcome', ... confounders=['age', 'invited_friend'] ... ) >>> >>> # Access data >>> causal_data.df.head() >>> >>> # Access columns by role >>> causal_data.target >>> causal_data.confounders >>> causal_data.treatment
- property target: Series#
Get the outcome/outcome variable.
- Returns:
The outcome column as a pandas Series.
- Return type:
pd.Series
- property treatment: Series#
Get the treatment variable.
- Returns:
The treatment column as a pandas Series.
- Return type:
pd.Series
- get_df(columns=None, include_treatment=True, include_target=True, include_confounders=True)[source]#
Get a DataFrame from the CausalData object with specified columns.
- Parameters:
columns (List[str], optional) – Specific column names to include in the returned DataFrame. If provided, these columns will be included in addition to any columns specified by the include parameters. If None, columns will be determined solely by the include parameters. If None and no include parameters are True, returns the entire DataFrame.
include_treatment (bool, default True) – Whether to include treatment column(s) in the returned DataFrame.
include_target (bool, default True) – Whether to include target column(s) in the returned DataFrame.
include_confounders (bool, default True) – Whether to include confounder column(s) in the returned DataFrame.
- Returns:
DataFrame containing the specified columns.
- Return type:
pd.DataFrame
Examples
>>> from causalkit.data import generate_rct_data >>> from causalkit.data import CausalData >>> >>> # Generate data >>> df = generate_rct_data() >>> >>> # Create CausalData object >>> causal_data = CausalData( ... df=df, ... treatment='treatment', ... outcome='outcome', ... confounders=['age', 'invited_friend'] ... ) >>> >>> # Get specific columns >>> causal_data.get_df(columns=['age']) >>> >>> # Get all columns >>> causal_data.get_df()
Data generation utilities for causal inference tasks.
- causalkit.data.generators.generate_rct_data(n_users=20000, split=0.5, random_state=42, target_type='binary', target_params=None)[source]#
Create synthetic RCT data using CausalDatasetGenerator as the core engine.
Treatment is randomized to approximately match split (independent of covariates).
Outcome distribution is controlled by target_type and target_params.
Returns a legacy-compatible schema with ancillary covariates derived from the outcome (age, cnt_trans, platform_Android, platform_iOS, invited_friend), plus a UUID user_id.
- Parameters:
n_users (int) – Total number of users in the dataset.
split (float) – Proportion of users in the treatment group (e.g., 0.5 => 50/50).
random_state (int, optional) – Seed for reproducibility.
target_type ({"binary","normal","nonnormal"}) – Outcome family. “nonnormal” is approximated via a Poisson mean process.
target_params (dict, optional) –
- If None, defaults are used:
binary : {“p”: {“A”: 0.10, “B”: 0.12}}
normal : {“mean”: {“A”: 0.00, “B”: 0.20}, “std”: 1.0}
nonnormal: {“shape”: 2.0, “scale”: {“A”: 1.0, “B”: 1.1}}
- Returns:
- Columns: user_id, treatment, outcome, age, cnt_trans,
platform_Android, platform_iOS, invited_friend.
- Return type:
pd.DataFrame
- Raises:
ValueError – If target_type is not one of {“binary”, “normal”, “nonnormal”}.
- class causalkit.data.generators.CausalDatasetGenerator(theta=1.0, tau=None, beta_y=None, beta_t=None, g_y=None, g_t=None, alpha_y=0.0, alpha_t=0.0, sigma_y=1.0, outcome_type='continuous', confounder_specs=None, k=5, x_sampler=None, target_t_rate=None, u_strength_t=0.0, u_strength_y=0.0, seed=None)[source]#
Bases:
object
Generate synthetic causal inference datasets with controllable confounding, treatment prevalence, noise, and (optionally) heterogeneous treatment effects.
Data model (high level)
Confounders X ∈ R^k are drawn from user-specified distributions.
- Binary treatment T is assigned by a logistic model:
T ~ Bernoulli( sigmoid(alpha_t + f_t(X) + u_strength_t * U) ),
where f_t(X) = X @ beta_t + g_t(X), and U ~ N(0,1) is an optional unobserved confounder.
- Outcome Y depends on treatment and confounders with link determined by outcome_type:
- outcome_type = “continuous”:
Y = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) + ε, ε ~ N(0, sigma_y^2)
- outcome_type = “binary”:
logit P(Y=1|T,X) = alpha_y + f_y(X) + u_strength_y * U + T * tau(X)
- outcome_type = “poisson”:
log E[Y|T,X] = alpha_y + f_y(X) + u_strength_y * U + T * tau(X)
where f_y(X) = X @ beta_y + g_y(X), and tau(X) is either constant theta or a user function.
- Returned columns
y: outcome
t: binary treatment (0/1)
x1..xk (or user-provided names)
propensity: P(T=1 | X) used to draw T (ground truth)
mu0: E[Y | T=0, X] on the natural outcome scale
mu1: E[Y | T=1, X] on the natural outcome scale
cate: mu1 - mu0 (conditional average treatment effect on the natural outcome scale)
- Notes on effect scale:
For “continuous”, theta (or tau(X)) is an additive mean difference.
For “binary”, tau acts on the log-odds scale (log-odds ratio).
For “poisson”, tau acts on the log-mean scale (log incidence-rate ratio).
- Parameters:
theta (float, default=1.0) – Constant treatment effect used if tau is None.
tau (callable or None, default=None) – Function tau(X) -> array-like shape (n,) for heterogeneous effects. Ignored if None.
beta_y (array-like of shape (k,), optional) – Linear coefficients of confounders in the outcome baseline f_y(X).
beta_t (array-like of shape (k,), optional) – Linear coefficients of confounders in the treatment score f_t(X) (log-odds scale).
g_y (callable, optional) – Nonlinear/additive function g_y(X) -> (n,) added to the outcome baseline.
g_t (callable, optional) – Nonlinear/additive function g_t(X) -> (n,) added to the treatment score.
alpha_y (float, default=0.0) – Outcome intercept (natural scale for continuous; log-odds for binary; log-mean for Poisson).
alpha_t (float, default=0.0) – Treatment intercept (log-odds). If target_t_rate is set, alpha_t is auto-calibrated.
sigma_y (float, default=1.0) – Std. dev. of the Gaussian noise for continuous outcomes.
outcome_type ({"continuous","binary","poisson"}, default="continuous") – Outcome family and link as defined above.
confounder_specs (list of dict, optional) –
- Schema for generating confounders. Each spec is one of:
{“name”: str, “dist”: “normal”, “mu”: float, “sd”: float} {“name”: str, “dist”: “uniform”, “a”: float, “b”: float} {“name”: str, “dist”: “bernoulli”,”p”: float} {“name”: str, “dist”: “categorical”, “categories”: list, “probs”: list}
For “categorical”, one-hot encoding is produced for all levels except the first.
k (int, default=5) – Number of confounders when confounder_specs is None. Defaults to independent N(0,1).
x_sampler (callable, optional) – Custom sampler (n, k, seed) -> X ndarray of shape (n,k). Overrides confounder_specs and k.
target_t_rate (float in (0,1), optional) – If set, calibrates alpha_t via bisection so that mean propensity ≈ outcome.
u_strength_t (float, default=0.0) – Strength of the unobserved confounder U in treatment assignment.
u_strength_y (float, default=0.0) – Strength of the unobserved confounder U in the outcome.
seed (int, optional) – Random seed for reproducibility.
- rng#
Internal RNG seeded from seed.
- Type:
Examples
>>> gen = CausalDatasetGenerator( ... theta=2.0, ... beta_y=np.array([1.0, -0.5, 0.2]), ... beta_t=np.array([0.8, 1.2, -0.3]), ... target_t_rate=0.35, ... outcome_type="continuous", ... sigma_y=1.0, ... seed=42, ... confounder_specs=[ ... {"name":"age", "dist":"normal", "mu":50, "sd":10}, ... {"name":"smoker", "dist":"bernoulli", "p":0.3}, ... {"name":"bmi", "dist":"normal", "mu":27, "sd":4}, ... ]) >>> df = gen.generate(10_000) >>> df.columns Index([... 'y','t','age','smoker','bmi','propensity','mu0','mu1','cate'], dtype='object')
- generate(n)[source]#
Draw a synthetic dataset of size n.
- Parameters:
n (int) – Number of observations to simulate.
- Returns:
- Columns:
y : float
t : {0.0, 1.0}
<confounder columns> : floats (and one-hot columns for categorical)
propensity : float in (0,1), true P(T=1 | X)
mu0 : expected outcome under control on the natural scale
mu1 : expected outcome under treatment on the natural scale
cate : mu1 - mu0 (conditional treatment effect on the natural scale)
- Return type:
Notes
If target_t_rate is set, alpha_t is internally recalibrated (bisection) on the current draw of X and U, so repeated calls can yield slightly different alpha_t values even with the same seed unless X and U are fixed.
For binary and Poisson outcomes, cate is reported on the natural scale (probability or mean), even though the structural model is specified on the log-odds / log-mean scale.
- to_causal_data(n, confounders=None)[source]#
Generate a dataset and convert it to a CausalData object.
- Parameters:
- Returns:
A CausalData object containing the generated data.
- Return type:
- __init__(theta=1.0, tau=None, beta_y=None, beta_t=None, g_y=None, g_t=None, alpha_y=0.0, alpha_t=0.0, sigma_y=1.0, outcome_type='continuous', confounder_specs=None, k=5, x_sampler=None, target_t_rate=None, u_strength_t=0.0, u_strength_y=0.0, seed=None)#