EDA Module#

The causalkit.eda module provides exploratory diagnostics for causal designs with binary treatment. It helps assess treatment predictability, overlap/positivity, covariate balance, and outcome modeling quality before running inference.

Overview#

Key components:

  • CausalEDA: High-level interface for EDA on CausalData or a lightweight container

  • CausalDataLite: Minimal data container compatible with CausalEDA

Main capabilities:

  • Outcome group statistics by treatment

  • Cross-validated propensity scores with ROC AUC and positivity checks

  • Covariate balance diagnostics (means, absolute diffs, and standardized mean differences)

  • Outcome model fit diagnostics (RMSE, MAE) and SHAP-based feature attributions for CatBoost models

  • Visualization helpers (propensity score overlap, distributions and boxplots)

API Reference#

CausalEDA

Exploratory diagnostics for causal designs with binary treatment.

CausalDataLite

A minimal container for dataset roles used by CausalEDA.

CausalEDA#

EDA utilities for causal analysis (propensity, overlap, balance, weights).

This module provides a lightweight CausalEDA class to quickly assess whether a binary treatment problem is suitable for causal effect estimation. The outputs focus on interpretability: treatment predictability, overlap/positivity, covariate balance before/after weighting, and basic data health.

What the main outputs mean

  • outcome_stats(): DataFrame with comprehensive statistics (count, mean, std, percentiles, min/max) for outcome grouped by treatment.

  • fit_propensity(): Numpy array of cross-validated propensity scores P(T=1|X).

  • confounders_roc_auc(): Float ROC AUC of treatment vs. propensity score. Higher AUC implies treatment is predictable from confounders (more confounding risk).

  • positivity_check(): Dict with bounds, share_below, share_above, and flag. It reports what share of units have PS outside [low, high]; a large share signals poor overlap (violated positivity).

  • plot_ps_overlap(): Overlaid histograms of PS for treated vs control.

  • confounders_means(): DataFrame with comprehensive balance assessment including means by treatment group, absolute differences, and standardized mean differences (SMD).

Note: The class accepts either the project’s CausalData object (duck-typed) or a CausalDataLite with explicit fields.

class causalkit.eda.eda.PropensityModel(propensity_scores, treatment_values, fitted_model, feature_names, X_for_shap=None, cat_features_for_shap=None)[source]#

Bases: object

A model for propensity scores and related diagnostics.

This class encapsulates propensity scores and provides methods for: - Computing ROC AUC - Extracting SHAP values - Plotting propensity score overlap - Checking positivity/overlap

The class is returned by CausalEDA.fit_propensity() and provides a cleaner interface for propensity score analysis.

Parameters:
  • propensity_scores (np.ndarray)

  • treatment_values (np.ndarray)

  • fitted_model (Any)

  • feature_names (List[str])

  • X_for_shap (Optional[np.ndarray])

  • cat_features_for_shap (Optional[List[int]])

__init__(propensity_scores, treatment_values, fitted_model, feature_names, X_for_shap=None, cat_features_for_shap=None)[source]#

Initialize PropensityModel with fitted model artifacts.

Parameters:
  • propensity_scores (np.ndarray) – Array of propensity scores P(T=1|X)

  • treatment_values (np.ndarray) – Array of actual treatment assignments (0/1)

  • fitted_model (Any) – The fitted propensity score model

  • feature_names (List[str]) – Names of features used in the model

  • X_for_shap (Optional[np.ndarray]) – Preprocessed feature matrix for SHAP computation

  • cat_features_for_shap (Optional[List[int]]) – Categorical feature indices for SHAP computation

property roc_auc: float#

Compute ROC AUC of treatment assignment vs. propensity scores.

Higher AUC means treatment is more predictable from confounders, indicating stronger systematic differences between groups (potential confounding). Values near 0.5 suggest random-like assignment.

Returns:

ROC AUC score between 0 and 1

Return type:

float

property shap: DataFrame#

Return SHAP values from the fitted propensity score model.

SHAP values show the directional contribution of each feature to treatment assignment prediction, where positive values increase treatment probability and negative values decrease it.

Returns:

For CatBoost models: DataFrame with columns ‘feature’ and ‘shap_mean’, where ‘shap_mean’ represents the mean SHAP value across all samples.

For sklearn models: DataFrame with columns ‘feature’ and ‘importance’ (absolute coefficient values, for backward compatibility).

Return type:

pd.DataFrame

Raises:

RuntimeError – If the fitted model does not support SHAP values extraction.

ps_graph()[source]#

Plot overlaid histograms of propensity scores for treated vs control.

Useful to visually assess group overlap. Does not return data; it draws on the current matplotlib figure.

positivity_check(bounds=(0.05, 0.95))[source]#

Check overlap/positivity based on propensity score thresholds.

Parameters:

bounds (Tuple[float, float], default (0.05, 0.95)) – Lower and upper thresholds for positivity check

Returns:

Dictionary with: - bounds: (low, high) thresholds used - share_below: fraction with PS < low - share_above: fraction with PS > high - flag: heuristic boolean True if the tails collectively exceed ~2%

Return type:

Dict[str, Any]

class causalkit.eda.eda.OutcomeModel(predicted_outcomes, actual_outcomes, fitted_model, feature_names, X_for_shap=None, cat_features_for_shap=None)[source]#

Bases: object

A model for outcome prediction and related diagnostics.

This class encapsulates outcome predictions and provides methods for: - Computing RMSE and MAE regression metrics - Extracting SHAP values for outcome prediction

The class is returned by CausalEDA.outcome_fit() and provides a cleaner interface for outcome model analysis.

Parameters:
  • predicted_outcomes (np.ndarray)

  • actual_outcomes (np.ndarray)

  • fitted_model (Any)

  • feature_names (List[str])

  • X_for_shap (Optional[np.ndarray])

  • cat_features_for_shap (Optional[List[int]])

__init__(predicted_outcomes, actual_outcomes, fitted_model, feature_names, X_for_shap=None, cat_features_for_shap=None)[source]#

Initialize OutcomeModel with fitted model artifacts.

Parameters:
  • predicted_outcomes (np.ndarray) – Array of predicted outcome values

  • actual_outcomes (np.ndarray) – Array of actual outcome values

  • fitted_model (Any) – The fitted outcome prediction model

  • feature_names (List[str]) – Names of features used in the model (confounders only)

  • X_for_shap (Optional[np.ndarray]) – Preprocessed feature matrix for SHAP computation

  • cat_features_for_shap (Optional[List[int]]) – Categorical feature indices for SHAP computation

property scores: Dict[str, float]#

Compute regression metrics (RMSE and MAE) for outcome predictions.

Returns:

Dictionary containing: - ‘rmse’: Root Mean Squared Error - ‘mae’: Mean Absolute Error

Return type:

Dict[str, float]

property shap: DataFrame#

Return SHAP values from the fitted outcome prediction model.

SHAP values show the directional contribution of each feature to outcome prediction, where positive values increase the predicted outcome and negative values decrease it.

Returns:

For CatBoost models: DataFrame with columns ‘feature’ and ‘shap_mean’, where ‘shap_mean’ represents the mean SHAP value across all samples.

For sklearn models: DataFrame with columns ‘feature’ and ‘importance’ (absolute coefficient values, for backward compatibility).

Return type:

pd.DataFrame

Raises:

RuntimeError – If the fitted model does not support SHAP values extraction.

class causalkit.eda.eda.CausalDataLite(df, treatment, target, confounders)[source]#

Bases: object

A minimal container for dataset roles used by CausalEDA.

Attributes - df: The full pandas DataFrame containing treatment, outcome and covariates. - treatment: Column name of the binary treatment indicator (0/1). - target: Column name of the outcome variable. - confounders: List of covariate column names used to model treatment.

Parameters:
df: DataFrame#
treatment: str#
target: str#
confounders: List[str]#
__init__(df, treatment, target, confounders)#
Parameters:
Return type:

None

class causalkit.eda.eda.CausalEDA(data, ps_model=None, n_splits=5, random_state=42)[source]#

Bases: object

Exploratory diagnostics for causal designs with binary treatment.

The class exposes methods to:

  • Summarize outcome by treatment and naive mean difference.

  • Estimate cross-validated propensity scores and assess treatment predictability (AUC) and positivity/overlap.

  • Inspect covariate balance via standardized mean differences (SMD) before/after IPTW weighting; visualize with a love plot.

  • Inspect weight distributions and effective sample size (ESS).

Parameters:
  • data (Any)

  • ps_model (Optional[Any])

  • n_splits (int)

  • random_state (int)

__init__(data, ps_model=None, n_splits=5, random_state=42)[source]#
Parameters:
  • data (Any)

  • ps_model (Any | None)

  • n_splits (int)

  • random_state (int)

data_shape()[source]#

Return the shape information of the causal dataset.

Returns a dict with: - n_rows: number of rows (observations) in the dataset - n_columns: number of columns (features) in the dataset

This provides a quick overview of the dataset dimensions for exploratory analysis and reporting purposes.

Returns:

Dictionary containing ‘n_rows’ and ‘n_columns’ keys with corresponding integer values representing the dataset dimensions.

Return type:

Dict[str, int]

Examples

>>> eda = CausalEDA(causal_data)
>>> shape_info = eda.data_shape()
>>> print(f"Dataset has {shape_info['n_rows']} rows and {shape_info['n_columns']} columns")
outcome_stats()[source]#

Comprehensive outcome statistics grouped by treatment.

Returns a DataFrame with detailed outcome statistics for each treatment group, including count, mean, std, min, various percentiles, and max. This method provides comprehensive outcome analysis and returns data in a clean DataFrame format suitable for reporting.

Returns:

DataFrame with treatment groups as index and the following columns: - count: number of observations in each group - mean: average outcome value - std: standard deviation of outcome - min: minimum outcome value - p10: 10th percentile - p25: 25th percentile (Q1) - median: 50th percentile (median) - p75: 75th percentile (Q3) - p90: 90th percentile - max: maximum outcome value

Return type:

pd.DataFrame

Examples

>>> eda = CausalEDA(causal_data)
>>> stats = eda.outcome_stats()
>>> print(stats)
        count      mean       std       min       p10       p25    median       p75       p90       max
treatment
0        3000  5.123456  2.345678  0.123456  2.345678  3.456789  5.123456  6.789012  7.890123  9.876543
1        2000  6.789012  2.456789  0.234567  3.456789  4.567890  6.789012  8.901234  9.012345  10.987654
fit_propensity()[source]#

Estimate cross-validated propensity scores P(T=1|X).

Uses a preprocessing+CatBoost classifier pipeline with stratified K-fold cross_val_predict to generate out-of-fold probabilities. CatBoost uses all available threads and handles categorical features natively. Returns a PropensityModel instance containing propensity scores and diagnostic methods.

Returns:

A PropensityModel instance with methods for: - roc_auc: ROC AUC score property - shap: SHAP values DataFrame property - ps_graph(): method to plot propensity score overlap - positivity_check(): method to check positivity/overlap

Return type:

PropensityModel

outcome_fit(outcome_model=None)[source]#

Fit a regression model to predict outcome from confounders only.

Uses a preprocessing+CatBoost regressor pipeline with K-fold cross_val_predict to generate out-of-fold predictions. CatBoost uses all available threads and handles categorical features natively. Returns an OutcomeModel instance containing predicted outcomes and diagnostic methods.

The outcome model predicts the baseline outcome from confounders only, excluding treatment. This is essential for proper causal analysis.

Parameters:

outcome_model (Optional[Any]) – Custom regression model to use. If None, uses CatBoostRegressor.

Returns:

An OutcomeModel instance with methods for: - scores: RMSE and MAE regression metrics - shap: SHAP values DataFrame property for outcome prediction

Return type:

OutcomeModel

confounders_roc_auc(ps=None)[source]#

Compute ROC AUC of treatment assignment vs. estimated propensity score.

Interpretation: Higher AUC means treatment is more predictable from confounders, indicating stronger systematic differences between groups (potential confounding). Values near 0.5 suggest random-like assignment.

Return type:

float

Parameters:

ps (ndarray | None)

positivity_check(ps=None, bounds=(0.05, 0.95))[source]#

Check overlap/positivity based on propensity score thresholds.

Returns a dict with: - bounds: (low, high) thresholds used - share_below: fraction with PS < low - share_above: fraction with PS > high - flag: heuristic boolean True if the tails collectively exceed ~2%

Return type:

Dict[str, Any]

Parameters:
plot_ps_overlap(ps=None)[source]#

Plot overlaid histograms of propensity scores for treated vs control.

Useful to visually assess group overlap. Does not return data; it draws on the current matplotlib figure.

Parameters:

ps (ndarray | None)

confounders_means()[source]#

Comprehensive confounders balance assessment with means by treatment group.

Returns a DataFrame with detailed balance information including: - Mean values of each confounder for control group (treatment=0) - Mean values of each confounder for treated group (treatment=1) - Absolute difference between treatment groups - Standardized Mean Difference (SMD) for formal balance assessment

This method provides a comprehensive view of confounder balance by showing the actual mean values alongside the standardized differences, making it easier to understand both the magnitude and direction of imbalances.

Returns:

DataFrame with confounders as index and the following columns: - mean_t_0: mean value for control group (treatment=0) - mean_t_1: mean value for treated group (treatment=1) - abs_diff: absolute difference abs(mean_t_1 - mean_t_0) - smd: standardized mean difference (Cohen’s d)

Return type:

pd.DataFrame

Notes

SMD values > 0.1 in absolute value typically indicate meaningful imbalance. Categorical variables are automatically converted to dummy variables.

Examples

>>> eda = CausalEDA(causal_data)
>>> balance = eda.confounders_means()
>>> print(balance.head())
             mean_t_0  mean_t_1  abs_diff       smd
confounders
age              29.5      31.2      1.7     0.085
income        45000.0   47500.0   2500.0     0.125
education         0.25      0.35      0.1     0.215
outcome_plots(treatment=None, target=None, bins=30, density=True, alpha=0.5, figsize=(7, 4), sharex=True)[source]#

Plot the distribution of the outcome for every treatment on one plot, and also produce a boxplot by treatment to visualize outliers.

Parameters:
  • treatment (Optional[str]) – Treatment column name. Defaults to the treatment stored in the CausalEDA data.

  • target (Optional[str]) – Target/outcome column name. Defaults to the outcome stored in the CausalEDA data.

  • bins (int) – Number of bins for histograms when the outcome is numeric.

  • density (bool) – Whether to normalize histograms to form a density.

  • alpha (float) – Transparency for overlaid histograms.

  • figsize (tuple) – Figure size for the plots.

  • sharex (bool) – If True and the outcome is numeric, use the same x-limits across treatments.

Returns:

(fig_distribution, fig_boxplot)

Return type:

Tuple[matplotlib.figure.Figure, matplotlib.figure.Figure]

treatment_features()[source]#

Return SHAP values from the fitted propensity score model.

This method extracts SHAP values from the propensity score model that was trained during fit_propensity(). SHAP values show the directional contribution of each feature to treatment assignment prediction, where positive values increase treatment probability and negative values decrease it.

Returns:

For CatBoost models: DataFrame with columns ‘feature’ and ‘shap_mean’, where ‘shap_mean’ represents the mean SHAP value across all samples. Positive values indicate features that increase treatment probability, negative values indicate features that decrease treatment probability.

For sklearn models: DataFrame with columns ‘feature’ and ‘importance’ (absolute coefficient values, for backward compatibility).

Return type:

pd.DataFrame

Raises:

RuntimeError – If fit_propensity() has not been called yet, or if the fitted model does not support SHAP values extraction.

Examples

>>> eda = CausalEDA(data)
>>> ps = eda.fit_propensity()  # Must be called first
>>> shap_df = eda.treatment_features()
>>> print(shap_df.head())
   feature  shap_mean
0  age         0.45  # Positive: increases treatment prob
1  income     -0.32  # Negative: decreases treatment prob
2  education   0.12  # Positive: increases treatment prob

Selected methods:

CausalEDA.data_shape

Return the shape information of the causal dataset.

CausalEDA.outcome_stats

Comprehensive outcome statistics grouped by treatment.

CausalEDA.fit_propensity

Estimate cross-validated propensity scores P(T=1|X).

CausalEDA.confounders_roc_auc

Compute ROC AUC of treatment assignment vs.

CausalEDA.positivity_check

Check overlap/positivity based on propensity score thresholds.

CausalEDA.plot_ps_overlap

Plot overlaid histograms of propensity scores for treated vs control.

CausalEDA.confounders_means

Comprehensive confounders balance assessment with means by treatment group.

CausalEDA.outcome_fit

Fit a regression model to predict outcome from confounders only.

CausalEDA.outcome_plots

Plot the distribution of the outcome for every treatment on one plot, and also produce a boxplot by treatment to visualize outliers.

CausalEDA.treatment_features

Return SHAP values from the fitted propensity score model.

CausalDataLite#

class causalkit.eda.eda.CausalDataLite(df, treatment, target, confounders)[source]#

Bases: object

A minimal container for dataset roles used by CausalEDA.

Attributes - df: The full pandas DataFrame containing treatment, outcome and covariates. - treatment: Column name of the binary treatment indicator (0/1). - target: Column name of the outcome variable. - confounders: List of covariate column names used to model treatment.

Parameters:
df: DataFrame#
treatment: str#
target: str#
confounders: List[str]#
__init__(df, treatment, target, confounders)#
Parameters:
Return type:

None