Pooling Analysis

After creating multiple imputed datasets, you need to analyze them and combine the results. This guide explains how to pool results using Rubin’s rules.

Why Pool Results?

You have multiple complete datasets (e.g., 5 or 10), each with different imputed values. To get final estimates, you must:

  1. Fit your model on each imputed dataset

  2. Pool the results to get single estimates

  3. Account for uncertainty from both within and between imputations

Simply averaging or using one dataset would underestimate uncertainty and produce invalid inferences.

Rubin’s Rules

Rubin’s rules provide a principled way to combine estimates from multiple imputed datasets, properly accounting for the uncertainty introduced by imputation.

Basic Concept

For each parameter (e.g., regression coefficient):

  1. Fit the model on each imputed dataset → get m estimates and standard errors

  2. Calculate within-imputation variance (average of squared SEs)

  3. Calculate between-imputation variance (variance of estimates across imputations)

  4. Total variance = within + between + correction term

  5. Use these to construct confidence intervals and perform inference

Using mice-py for Pooling

Simple Workflow

from imputation import MICE

# 1. Perform imputation
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10)

# 2. Fit a model using formula syntax
mice.fit('outcome ~ predictor1 + predictor2 + predictor3')

# 3. Pool results
pooled = mice.pool(summ=True)
print(pooled)

The fit() method uses formula syntax (like R or statsmodels) and fits the model on all imputed datasets.

The pool() method combines results using Rubin’s rules.

Understanding the Output

The pooled results include:

Estimate

The pooled coefficient (average across imputed datasets)

Std.Error

The pooled standard error (accounting for both within and between variance)

t-statistic

The test statistic for the coefficient

df

Degrees of freedom (adjusted for imputation)

p-value

Statistical significance

95% CI Lower/Upper

Confidence interval bounds

FMI

Fraction of Missing Information (see below)

Example Output

                 Estimate  Std.Error  t-statistic     df   P>|t|  [0.025  0.975]    FMI
Intercept         45.234      5.321        8.502  42.15  <0.001  34.449  56.019  0.156
predictor1         0.823      0.142        5.796  38.27  <0.001   0.535   1.111  0.198
predictor2        -1.234      0.387       -3.189  51.83   0.002  -2.012  -0.456  0.089
predictor3         2.156      0.921        2.341  45.67   0.024   0.301   4.011  0.132

Fraction of Missing Information (FMI)

FMI indicates how much the uncertainty in your estimate is due to missing data:

  • FMI = 0: No missing information, equivalent to complete data analysis

  • FMI = 0.1: 10% of uncertainty is due to missingness

  • FMI = 0.5: Half the uncertainty is from missingness

  • FMI = 1: Complete uncertainty from missingness (rare)

If FMI > 0.3, consider using more imputations.

Formula Syntax

The fit() method uses Patsy formula syntax:

Basic Formulas

# Simple linear regression
mice.fit('y ~ x')

# Multiple predictors
mice.fit('y ~ x1 + x2 + x3')

# With interaction
mice.fit('y ~ x1 + x2 + x1:x2')

# Or equivalently
mice.fit('y ~ x1 * x2')  # Includes x1, x2, and x1:x2

# Polynomial terms
mice.fit('y ~ x + I(x**2)')

# No intercept
mice.fit('y ~ x - 1')

Categorical Variables

# Categorical predictor (automatically creates dummies)
mice.fit('income ~ age + C(education)')

# Change reference category
mice.fit('income ~ age + C(education, Treatment("High School"))')

Transformations

# Log transformation
mice.fit('log_y ~ x1 + x2')

# Use numpy functions
mice.fit('y ~ np.log(x1) + np.sqrt(x2)')

Advanced Pooling

Pool Without Summary

Get detailed results for each imputation:

# Get individual results and pooled results
pooled_detailed = mice.pool(summ=False)

# Access individual imputation results
individual_results = pooled_detailed['individual']

# Access pooled results
pooled_results = pooled_detailed['pooled']

Custom Analysis

For models not supported by fit(), manually fit and pool:

import numpy as np
from sklearn.linear_model import LogisticRegression

# Fit on each imputed dataset
coefficients = []
std_errors = []

for dataset in mice.imputed_datasets:
    X = dataset[['predictor1', 'predictor2']]
    y = dataset['outcome']

    model = LogisticRegression()
    model.fit(X, y)

    coefficients.append(model.coef_[0])
    # Calculate std errors (simplified)
    # In practice, use proper methods for your model

# Pool manually using Rubin's rules
from imputation.pooling import pool_estimates
pooled = pool_estimates(coefficients, std_errors)

Interpreting Pooled Results

Statistical Significance

Use the pooled p-values and confidence intervals for inference:

pooled = mice.pool(summ=True)

# Check significance
significant = pooled[pooled['P>|t|'] < 0.05]
print("Significant predictors:")
print(significant)

The pooled standard errors are larger than those from a single dataset (accounting for imputation uncertainty), so some predictors significant in a single imputation might not be significant when properly pooled.

Effect Sizes

The pooled estimates are your best point estimates:

# Extract coefficient for predictor1
coef = pooled.loc['predictor1', 'Estimate']
ci_lower = pooled.loc['predictor1', '[0.025']
ci_upper = pooled.loc['predictor1', '0.975]']

print(f"predictor1: {coef:.3f} (95% CI: [{ci_lower:.3f}, {ci_upper:.3f}])")

Model Comparison

When comparing models, use pooled results:

# Fit two models
mice.fit('y ~ x1')
results_simple = mice.pool(summ=True)

mice.fit('y ~ x1 + x2 + x3')
results_complex = mice.pool(summ=True)

# Compare based on pooled coefficients and FMI

How Many Imputations?

General Guidelines

Minimum: 5 imputations

Acceptable for low missingness (<10%)

Recommended: 10-20 imputations

Good balance between computation and precision

High missingness: 20-100 imputations

When missingness >30% or FMI >0.3

Rule of thumb: Number of imputations ≈ percentage of missing cases

Von Hippel (2020) suggests: m = # of missing cases / # of complete cases × 100

Checking If You Have Enough

If FMI is high (>0.3) and results are unstable across repeated analyses, you may need more imputations:

# Check FMI
pooled = mice.pool(summ=True)
max_fmi = pooled['FMI'].max()

if max_fmi > 0.3:
    print(f"High FMI ({max_fmi:.2f}). Consider more imputations.")

Common Pitfalls

Don’t Use Single Imputation

Wrong:

# Using only the first imputed dataset
dataset = mice.imputed_datasets[0]
model = smf.ols('y ~ x1 + x2', data=dataset).fit()
print(model.summary())

Correct:

# Fit on all and pool
mice.fit('y ~ x1 + x2')
pooled = mice.pool(summ=True)
print(pooled)

Don’t Average Imputed Values

Wrong:

# Averaging imputed datasets
averaged = pd.concat(mice.imputed_datasets).groupby(level=0).mean()
model = smf.ols('y ~ x1 + x2', data=averaged).fit()

This is actually single imputation and underestimates uncertainty!

Correct: Use proper pooling with Rubin’s rules

Don’t Ignore Imputation Uncertainty

Standard errors from a single imputed dataset are too small. Always pool!

Checking Results

After pooling, check:

pooled = mice.pool(summ=True)

# Summary statistics
print(f"Mean FMI: {pooled['FMI'].mean():.3f}")
print(f"Max FMI: {pooled['FMI'].max():.3f}")
print(f"Mean df: {pooled['df'].mean():.1f}")

Tips for Better Pooling

  1. More imputations: When in doubt, use more (20-50)

  2. Check FMI: High values suggest need for more imputations

  3. Complete convergence: Ensure MICE converged before pooling

  4. Include all relevant variables: In both imputation and analysis models

  5. Be cautious with transformations: Pool on the analysis scale

Next Steps