Rubin’s Rules

How to combine results from multiple imputed datasets.

The Problem

After imputation, you have m estimates: \(\hat{\theta}_1, ..., \hat{\theta}_m\)

Simple averaging ignores imputation uncertainty. Rubin’s rules provide the correct solution.

Basic Formulas

Pooled Estimate

\[\bar{\theta} = \frac{1}{m}\sum_{i=1}^{m} \hat{\theta}_i\]

Within-Imputation Variance (average sampling variance)

\[\bar{U} = \frac{1}{m}\sum_{i=1}^{m} SE_i^2\]

Between-Imputation Variance (variance due to missing data)

\[B = \frac{1}{m-1}\sum_{i=1}^{m} (\hat{\theta}_i - \bar{\theta})^2\]

Total Variance

\[T = \bar{U} + B + \frac{B}{m}\]

Standard Error

\[SE = \sqrt{T}\]

Confidence Interval

\[\bar{\theta} \pm t_{df} \times SE\]

where \(t_{df}\) is from t-distribution with adjusted degrees of freedom.

Fraction of Missing Information (FMI)

\[FMI = \frac{(1 + 1/m)B}{T}\]
Interpretation:
  • FMI = 0: No impact from missing data

  • FMI = 0.3: 30% of uncertainty due to missingness

  • FMI > 0.3: Consider more imputations

How Many Imputations?

Old rule: m = 5 Modern recommendation: m = 20+ High missingness: m = 50-100

Rule of thumb: m ≈ percentage of incomplete cases

Usage in mice-py

mice.fit('outcome ~ predictor')
pooled = mice.pool(summ=True)
Output includes:
  • Estimate: Pooled coefficient

  • Std.Error: Pooled SE

  • P>|t|: p-value

  • FMI: Fraction of missing information

See Pooling Analysis for practical usage.