Rubin’s Rules

How to combine results from multiple imputed datasets.

The Problem

After imputation, you have m estimates: \(\hat{\theta}_1, ..., \hat{\theta}_m\)

Simple averaging ignores imputation uncertainty. Rubin’s rules provide the correct solution.

Pooled Estimate

\[\bar{\theta} = \frac{1}{m}\sum_{i=1}^{m} \hat{\theta}_i\]

Within-Imputation Variance (average sampling variance)

\[\bar{U} = \frac{1}{m}\sum_{i=1}^{m} SE_i^2\]

Between-Imputation Variance (variance due to missing data)

\[B = \frac{1}{m-1}\sum_{i=1}^{m} (\hat{\theta}_i - \bar{\theta})^2\]

Total Variance

\[T = \bar{U} + B + \frac{B}{m}\]

Standard Error

\[SE = \sqrt{T}\]

Confidence Interval

\[\bar{\theta} \pm t_{df} \times SE\]

where \(t_{df}\) is from t-distribution with adjusted degrees of freedom.

\[FMI = \frac{(1 + 1/m)B}{T}\]

Interpretation:

Old rule: m = 5 Modern recommendation: m = 20+ High missingness: m = 50-100

Rule of thumb: m ≈ percentage of incomplete cases

mice.fit('outcome ~ predictor')
pooled = mice.pool(summ=True)

Output includes:

See Pooling Analysis for practical usage.