Pooling Analysis

After creating multiple imputed datasets, you need to analyze them and combine the results. This guide explains how to pool results using Rubin’s rules.

Why Pool Results?

You have multiple complete datasets (e.g., 5 or 10), each with different imputed values. To get final estimates, you must:

Fit your model on each imputed dataset
Pool the results to get single estimates
Account for uncertainty from both within and between imputations

Simply averaging or using one dataset would underestimate uncertainty and produce invalid inferences.

Rubin’s Rules

Rubin’s rules provide a principled way to combine estimates from multiple imputed datasets, properly accounting for the uncertainty introduced by imputation.

Basic Concept

For each parameter (e.g., regression coefficient):

Fit the model on each imputed dataset → get m estimates and standard errors
Calculate within-imputation variance (average of squared SEs)
Calculate between-imputation variance (variance of estimates across imputations)
Total variance = within + between + correction term
Use these to construct confidence intervals and perform inference

Using mice-py for Pooling

Simple Workflow

from imputation import MICE

# 1. Perform imputation
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10)

# 2. Fit a model using formula syntax
mice.fit('outcome ~ predictor1 + predictor2 + predictor3')

# 3. Pool results
pooled = mice.pool(summ=True)
print(pooled)

The fit() method uses formula syntax (like R or statsmodels) and fits the model on all imputed datasets.

The pool() method combines results using Rubin’s rules.

Understanding the Output

The pooled results include:

Estimate: The pooled coefficient (average across imputed datasets)
Std.Error: The pooled standard error (accounting for both within and between variance)
t-statistic: The test statistic for the coefficient
df: Degrees of freedom (adjusted for imputation)
p-value: Statistical significance
95% CI Lower/Upper: Confidence interval bounds
FMI: Fraction of Missing Information (see below)

Example Output

                 Estimate  Std.Error  t-statistic     df   P>|t|  [0.025  0.975]    FMI
Intercept         45.234      5.321        8.502  42.15  <0.001  34.449  56.019  0.156
predictor1         0.823      0.142        5.796  38.27  <0.001   0.535   1.111  0.198
predictor2        -1.234      0.387       -3.189  51.83   0.002  -2.012  -0.456  0.089
predictor3         2.156      0.921        2.341  45.67   0.024   0.301   4.011  0.132

Fraction of Missing Information (FMI)

FMI indicates how much the uncertainty in your estimate is due to missing data:

FMI = 0: No missing information, equivalent to complete data analysis
FMI = 0.1: 10% of uncertainty is due to missingness
FMI = 0.5: Half the uncertainty is from missingness
FMI = 1: Complete uncertainty from missingness (rare)

If FMI > 0.3, consider using more imputations.

Formula Syntax

The fit() method uses Patsy formula syntax:

Basic Formulas

# Simple linear regression
mice.fit('y ~ x')

# Multiple predictors
mice.fit('y ~ x1 + x2 + x3')

# With interaction
mice.fit('y ~ x1 + x2 + x1:x2')

# Or equivalently
mice.fit('y ~ x1 * x2')  # Includes x1, x2, and x1:x2

# Polynomial terms
mice.fit('y ~ x + I(x**2)')

# No intercept
mice.fit('y ~ x - 1')

Categorical Variables

# Categorical predictor (automatically creates dummies)
mice.fit('income ~ age + C(education)')

# Change reference category
mice.fit('income ~ age + C(education, Treatment("High School"))')

Transformations

# Log transformation
mice.fit('log_y ~ x1 + x2')

# Use numpy functions
mice.fit('y ~ np.log(x1) + np.sqrt(x2)')

Advanced Pooling

Pool Without Summary

Get detailed results for each imputation:

# Get individual results and pooled results
pooled_detailed = mice.pool(summ=False)

# Access individual imputation results
individual_results = pooled_detailed['individual']

# Access pooled results
pooled_results = pooled_detailed['pooled']

Custom Analysis

For models not supported by fit(), manually fit and pool:

import numpy as np
from sklearn.linear_model import LogisticRegression

# Fit on each imputed dataset
coefficients = []
std_errors = []

for dataset in mice.imputed_datasets:
    X = dataset[['predictor1', 'predictor2']]
    y = dataset['outcome']

    model = LogisticRegression()
    model.fit(X, y)

    coefficients.append(model.coef_[0])
    # Calculate std errors (simplified)
    # In practice, use proper methods for your model

# Pool manually using Rubin's rules
from imputation.pooling import pool_estimates
pooled = pool_estimates(coefficients, std_errors)

Interpreting Pooled Results

Statistical Significance

Use the pooled p-values and confidence intervals for inference:

pooled = mice.pool(summ=True)

# Check significance
significant = pooled[pooled['P>|t|'] < 0.05]
print("Significant predictors:")
print(significant)

The pooled standard errors are larger than those from a single dataset (accounting for imputation uncertainty), so some predictors significant in a single imputation might not be significant when properly pooled.

Effect Sizes

The pooled estimates are your best point estimates:

# Extract coefficient for predictor1
coef = pooled.loc['predictor1', 'Estimate']
ci_lower = pooled.loc['predictor1', '[0.025']
ci_upper = pooled.loc['predictor1', '0.975]']

print(f"predictor1: {coef:.3f} (95% CI: [{ci_lower:.3f}, {ci_upper:.3f}])")

Model Comparison

When comparing models, use pooled results:

# Fit two models
mice.fit('y ~ x1')
results_simple = mice.pool(summ=True)

mice.fit('y ~ x1 + x2 + x3')
results_complex = mice.pool(summ=True)

# Compare based on pooled coefficients and FMI

How Many Imputations?

General Guidelines

Minimum: 5 imputations: Acceptable for low missingness (<10%)
Recommended: 10-20 imputations: Good balance between computation and precision
High missingness: 20-100 imputations: When missingness >30% or FMI >0.3

Rule of thumb: Number of imputations ≈ percentage of missing cases

Von Hippel (2020) suggests: m = # of missing cases / # of complete cases × 100

Checking If You Have Enough

If FMI is high (>0.3) and results are unstable across repeated analyses, you may need more imputations:

# Check FMI
pooled = mice.pool(summ=True)
max_fmi = pooled['FMI'].max()

if max_fmi > 0.3:
    print(f"High FMI ({max_fmi:.2f}). Consider more imputations.")

Common Pitfalls

Don’t Use Single Imputation

❌ Wrong:

# Using only the first imputed dataset
dataset = mice.imputed_datasets[0]
model = smf.ols('y ~ x1 + x2', data=dataset).fit()
print(model.summary())

✓ Correct:

# Fit on all and pool
mice.fit('y ~ x1 + x2')
pooled = mice.pool(summ=True)
print(pooled)

Don’t Average Imputed Values

❌ Wrong:

# Averaging imputed datasets
averaged = pd.concat(mice.imputed_datasets).groupby(level=0).mean()
model = smf.ols('y ~ x1 + x2', data=averaged).fit()

This is actually single imputation and underestimates uncertainty!

✓ Correct: Use proper pooling with Rubin’s rules

Don’t Ignore Imputation Uncertainty

Standard errors from a single imputed dataset are too small. Always pool!

Checking Results

After pooling, check:

pooled = mice.pool(summ=True)

# Summary statistics
print(f"Mean FMI: {pooled['FMI'].mean():.3f}")
print(f"Max FMI: {pooled['FMI'].max():.3f}")
print(f"Mean df: {pooled['df'].mean():.1f}")

Tips for Better Pooling

More imputations: When in doubt, use more (20-50)
Check FMI: High values suggest need for more imputations
Complete convergence: Ensure MICE converged before pooling
Include all relevant variables: In both imputation and analysis models
Be cautious with transformations: Pool on the analysis scale

Next Steps

Read Best Practices for overall guidance
Review Rubin’s Rules for mathematical details
See complete examples in Examples