Pooling Analysis
After creating multiple imputed datasets, you need to analyze them and combine the results. This guide explains how to pool results using Rubin’s rules.
Why Pool Results?
You have multiple complete datasets (e.g., 5 or 10), each with different imputed values. To get final estimates, you must:
Fit your model on each imputed dataset
Pool the results to get single estimates
Account for uncertainty from both within and between imputations
Simply averaging or using one dataset would underestimate uncertainty and produce invalid inferences.
Rubin’s Rules
Rubin’s rules provide a principled way to combine estimates from multiple imputed datasets, properly accounting for the uncertainty introduced by imputation.
Basic Concept
For each parameter (e.g., regression coefficient):
Fit the model on each imputed dataset → get m estimates and standard errors
Calculate within-imputation variance (average of squared SEs)
Calculate between-imputation variance (variance of estimates across imputations)
Total variance = within + between + correction term
Use these to construct confidence intervals and perform inference
Using mice-py for Pooling
Simple Workflow
from imputation import MICE
# 1. Perform imputation
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10)
# 2. Fit a model using formula syntax
mice.fit('outcome ~ predictor1 + predictor2 + predictor3')
# 3. Pool results
pooled = mice.pool(summ=True)
print(pooled)
The fit() method uses formula syntax (like R or statsmodels) and fits the model
on all imputed datasets.
The pool() method combines results using Rubin’s rules.
Understanding the Output
The pooled results include:
- Estimate
The pooled coefficient (average across imputed datasets)
- Std.Error
The pooled standard error (accounting for both within and between variance)
- t-statistic
The test statistic for the coefficient
- df
Degrees of freedom (adjusted for imputation)
- p-value
Statistical significance
- 95% CI Lower/Upper
Confidence interval bounds
- FMI
Fraction of Missing Information (see below)
Example Output
Estimate Std.Error t-statistic df P>|t| [0.025 0.975] FMI
Intercept 45.234 5.321 8.502 42.15 <0.001 34.449 56.019 0.156
predictor1 0.823 0.142 5.796 38.27 <0.001 0.535 1.111 0.198
predictor2 -1.234 0.387 -3.189 51.83 0.002 -2.012 -0.456 0.089
predictor3 2.156 0.921 2.341 45.67 0.024 0.301 4.011 0.132
Fraction of Missing Information (FMI)
FMI indicates how much the uncertainty in your estimate is due to missing data:
FMI = 0: No missing information, equivalent to complete data analysis
FMI = 0.1: 10% of uncertainty is due to missingness
FMI = 0.5: Half the uncertainty is from missingness
FMI = 1: Complete uncertainty from missingness (rare)
If FMI > 0.3, consider using more imputations.
Formula Syntax
The fit() method uses Patsy formula syntax:
Basic Formulas
# Simple linear regression
mice.fit('y ~ x')
# Multiple predictors
mice.fit('y ~ x1 + x2 + x3')
# With interaction
mice.fit('y ~ x1 + x2 + x1:x2')
# Or equivalently
mice.fit('y ~ x1 * x2') # Includes x1, x2, and x1:x2
# Polynomial terms
mice.fit('y ~ x + I(x**2)')
# No intercept
mice.fit('y ~ x - 1')
Categorical Variables
# Categorical predictor (automatically creates dummies)
mice.fit('income ~ age + C(education)')
# Change reference category
mice.fit('income ~ age + C(education, Treatment("High School"))')
Transformations
# Log transformation
mice.fit('log_y ~ x1 + x2')
# Use numpy functions
mice.fit('y ~ np.log(x1) + np.sqrt(x2)')
Advanced Pooling
Pool Without Summary
Get detailed results for each imputation:
# Get individual results and pooled results
pooled_detailed = mice.pool(summ=False)
# Access individual imputation results
individual_results = pooled_detailed['individual']
# Access pooled results
pooled_results = pooled_detailed['pooled']
Custom Analysis
For models not supported by fit(), manually fit and pool:
import numpy as np
from sklearn.linear_model import LogisticRegression
# Fit on each imputed dataset
coefficients = []
std_errors = []
for dataset in mice.imputed_datasets:
X = dataset[['predictor1', 'predictor2']]
y = dataset['outcome']
model = LogisticRegression()
model.fit(X, y)
coefficients.append(model.coef_[0])
# Calculate std errors (simplified)
# In practice, use proper methods for your model
# Pool manually using Rubin's rules
from imputation.pooling import pool_estimates
pooled = pool_estimates(coefficients, std_errors)
Interpreting Pooled Results
Statistical Significance
Use the pooled p-values and confidence intervals for inference:
pooled = mice.pool(summ=True)
# Check significance
significant = pooled[pooled['P>|t|'] < 0.05]
print("Significant predictors:")
print(significant)
The pooled standard errors are larger than those from a single dataset (accounting for imputation uncertainty), so some predictors significant in a single imputation might not be significant when properly pooled.
Effect Sizes
The pooled estimates are your best point estimates:
# Extract coefficient for predictor1
coef = pooled.loc['predictor1', 'Estimate']
ci_lower = pooled.loc['predictor1', '[0.025']
ci_upper = pooled.loc['predictor1', '0.975]']
print(f"predictor1: {coef:.3f} (95% CI: [{ci_lower:.3f}, {ci_upper:.3f}])")
Model Comparison
When comparing models, use pooled results:
# Fit two models
mice.fit('y ~ x1')
results_simple = mice.pool(summ=True)
mice.fit('y ~ x1 + x2 + x3')
results_complex = mice.pool(summ=True)
# Compare based on pooled coefficients and FMI
How Many Imputations?
General Guidelines
- Minimum: 5 imputations
Acceptable for low missingness (<10%)
- Recommended: 10-20 imputations
Good balance between computation and precision
- High missingness: 20-100 imputations
When missingness >30% or FMI >0.3
Rule of thumb: Number of imputations ≈ percentage of missing cases
Von Hippel (2020) suggests: m = # of missing cases / # of complete cases × 100
Checking If You Have Enough
If FMI is high (>0.3) and results are unstable across repeated analyses, you may need more imputations:
# Check FMI
pooled = mice.pool(summ=True)
max_fmi = pooled['FMI'].max()
if max_fmi > 0.3:
print(f"High FMI ({max_fmi:.2f}). Consider more imputations.")
Common Pitfalls
Don’t Use Single Imputation
❌ Wrong:
# Using only the first imputed dataset
dataset = mice.imputed_datasets[0]
model = smf.ols('y ~ x1 + x2', data=dataset).fit()
print(model.summary())
✓ Correct:
# Fit on all and pool
mice.fit('y ~ x1 + x2')
pooled = mice.pool(summ=True)
print(pooled)
Don’t Average Imputed Values
❌ Wrong:
# Averaging imputed datasets
averaged = pd.concat(mice.imputed_datasets).groupby(level=0).mean()
model = smf.ols('y ~ x1 + x2', data=averaged).fit()
This is actually single imputation and underestimates uncertainty!
✓ Correct: Use proper pooling with Rubin’s rules
Don’t Ignore Imputation Uncertainty
Standard errors from a single imputed dataset are too small. Always pool!
Checking Results
After pooling, check:
pooled = mice.pool(summ=True)
# Summary statistics
print(f"Mean FMI: {pooled['FMI'].mean():.3f}")
print(f"Max FMI: {pooled['FMI'].max():.3f}")
print(f"Mean df: {pooled['df'].mean():.1f}")
Tips for Better Pooling
More imputations: When in doubt, use more (20-50)
Check FMI: High values suggest need for more imputations
Complete convergence: Ensure MICE converged before pooling
Include all relevant variables: In both imputation and analysis models
Be cautious with transformations: Pool on the analysis scale
Next Steps
Read Best Practices for overall guidance
Review Rubin’s Rules for mathematical details
See complete examples in Examples