Design Module#

The causalkit.design module provides utilities for designing experiments and splitting traffic.

Overview#

This module includes functions for:

  • Splitting traffic for experiments with customizable ratios

  • Supporting stratified splitting to maintain distribution of key variables

API Reference#

Utility functions for splitting traffic data from DataFrames.

causalkit.design.traffic_splitter.split_traffic(df, split_ratio=0.5, stratify_column=None, random_state=None)[source]#

Split a DataFrame into multiple parts based on the specified ratio.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing traffic data.

  • split_ratio (float or list of floats, default 0.5) – If float, represents the proportion of the DataFrame to include in the first split. If list, each value represents the proportion for each split. The values should sum to 1.

  • stratify_column (str, optional) – Column name to use for stratified splitting. If provided, the splits will have the same proportion of values in this column.

  • random_state (int, optional) – Random seed for reproducibility.

Returns:

A tuple containing the split DataFrames. If split_ratio is a float, returns a tuple of two DataFrames. If split_ratio is a list, returns a tuple with length equal to len(split_ratio) + 1.

Return type:

tuple of pd.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'user_id': range(100), 'group': ['A', 'B'] * 50})
>>> train_df, test_df = split_traffic(df, split_ratio=0.8, random_state=42)
>>> len(train_df), len(test_df)
(80, 20)
>>> train_df, val_df, test_df = split_traffic(df, split_ratio=[0.7, 0.2], random_state=42)
>>> len(train_df), len(val_df), len(test_df)
(70, 20, 10)

Utility functions for calculating Minimum Detectable Effect (MDE) for experimental design.

causalkit.design.mde.calculate_mde(sample_size, baseline_rate=None, variance=None, alpha=0.05, power=0.8, data_type='conversion', ratio=0.5)[source]#

Calculate the Minimum Detectable Effect (MDE) for conversion or continuous data.

Parameters:
  • sample_size (int or tuple of int) – Total sample size or a tuple of (control_size, treatment_size). If a single integer is provided, the sample will be split according to the ratio parameter.

  • baseline_rate (float, optional) – Baseline conversion rate (for conversion data) or baseline mean (for continuous data). Required for conversion data.

  • variance (float or tuple of float, optional) – Variance of the data. For conversion data, this is calculated from the baseline rate if not provided. For continuous data, this parameter is required. Can be a single float (assumed same for both groups) or a tuple of (control_variance, treatment_variance).

  • alpha (float, default 0.05) – Significance level (Type I error rate).

  • power (float, default 0.8) – Statistical power (1 - Type II error rate).

  • data_type (str, default 'conversion') – Type of data. Either ‘conversion’ for binary/conversion data or ‘continuous’ for continuous data.

  • ratio (float, default 0.5) – Ratio of the sample allocated to the control group if sample_size is a single integer.

Returns:

A dictionary containing: - ‘mde’: The minimum detectable effect (absolute) - ‘mde_relative’: The minimum detectable effect as a percentage of the baseline (relative) - ‘parameters’: The parameters used for the calculation

Return type:

Dict[str, Any]

Examples

>>> # Calculate MDE for conversion data with 1000 total sample size and 10% baseline conversion rate
>>> calculate_mde(1000, baseline_rate=0.1, data_type='conversion')
{'mde': 0.0527..., 'mde_relative': 0.5272..., 'parameters': {...}}
>>> # Calculate MDE for continuous data with 500 samples in each group and variance of 4
>>> calculate_mde((500, 500), variance=4, data_type='continuous')
{'mde': 0.3482..., 'mde_relative': None, 'parameters': {...}}

Notes

For conversion data, the MDE is calculated using the formula: MDE = (z_α/2 + z_β) * sqrt((p1*(1-p1)/n1) + (p2*(1-p2)/n2))

For continuous data, the MDE is calculated using the formula: MDE = (z_α/2 + z_β) * sqrt((σ1²/n1) + (σ2²/n2))

where: - z_α/2 is the critical value for significance level α - z_β is the critical value for power - p1 and p2 are the conversion rates in the control and treatment groups - σ1² and σ2² are the variances in the control and treatment groups - n1 and n2 are the sample sizes in the control and treatment groups