Data schema & when to measure each variable
Table layout (wide format)
column |
meaning |
type |
when to measure |
|---|---|---|---|
|
unit identifier |
int/str |
baseline |
|
treatment indicator (0/1) |
int |
t₁ (decision/exposure time) |
|
outcome (numeric or binary) |
float/int |
t₂ > t₁ (follow-up) |
|
confounders (all causes of both |
float/int |
t₀ < t₁ (strictly pre-treatment) |
Timing rule of thumb (avoid post-treatment bias):
t₀ (baseline): measure X (pre-treatment confounders only)
↓
t₁ (assign/observe D)
↓
t₂ (observe Y) — make sure no X measured here leaks into the model
Do not include mediators (variables affected by
D) inX. DML/IRM assumesXis pre-treatmentIf panel data exist, freeze a snapshot of X right before t₁.
Question & estimand
State causal question (e.g., “Effect of D on Y for the target population at t₁–t₂”).
Choose estimand: ATE or ATTE
Identification assumptions
Unconfoundedness: \(((Y(1),Y(0)) \perp D \mid X)\) (with your chosen, pre-treatment (X)).
Overlap (positivity): \((0<e(X)=P(D=1\mid X)<1)\) almost surely.
Consistency/SUTVA: well-defined treatment, no interference.
Score check: psi_mean, derivatives, psi_kurtosis
Report:
p-value and CI
Assumptions tests
Why were such variables chosen
Quantity of units in research
Conclusion for the decision-maker and what decisions would be made