Pooling Analysis ================ After creating multiple imputed datasets, you need to analyze them and combine the results. This guide explains how to pool results using Rubin's rules. Why Pool Results? ----------------- You have multiple complete datasets (e.g., 5 or 10), each with different imputed values. To get final estimates, you must: 1. **Fit your model** on each imputed dataset 2. **Pool the results** to get single estimates 3. **Account for uncertainty** from both within and between imputations Simply averaging or using one dataset would underestimate uncertainty and produce invalid inferences. Rubin's Rules ------------- Rubin's rules provide a principled way to combine estimates from multiple imputed datasets, properly accounting for the uncertainty introduced by imputation. Basic Concept ~~~~~~~~~~~~~ For each parameter (e.g., regression coefficient): 1. Fit the model on each imputed dataset → get *m* estimates and standard errors 2. Calculate **within-imputation variance** (average of squared SEs) 3. Calculate **between-imputation variance** (variance of estimates across imputations) 4. **Total variance** = within + between + correction term 5. Use these to construct confidence intervals and perform inference Using mice-py for Pooling -------------------------- Simple Workflow ~~~~~~~~~~~~~~~ .. code-block:: python from imputation import MICE # 1. Perform imputation mice = MICE(df) mice.impute(n_imputations=5, maxit=10) # 2. Fit a model using formula syntax mice.fit('outcome ~ predictor1 + predictor2 + predictor3') # 3. Pool results pooled = mice.pool(summ=True) print(pooled) The ``fit()`` method uses formula syntax (like R or statsmodels) and fits the model on all imputed datasets. The ``pool()`` method combines results using Rubin's rules. Understanding the Output ------------------------- The pooled results include: **Estimate** The pooled coefficient (average across imputed datasets) **Std.Error** The pooled standard error (accounting for both within and between variance) **t-statistic** The test statistic for the coefficient **df** Degrees of freedom (adjusted for imputation) **p-value** Statistical significance **95% CI Lower/Upper** Confidence interval bounds **FMI** Fraction of Missing Information (see below) Example Output ~~~~~~~~~~~~~~ .. code-block:: text Estimate Std.Error t-statistic df P>|t| [0.025 0.975] FMI Intercept 45.234 5.321 8.502 42.15 <0.001 34.449 56.019 0.156 predictor1 0.823 0.142 5.796 38.27 <0.001 0.535 1.111 0.198 predictor2 -1.234 0.387 -3.189 51.83 0.002 -2.012 -0.456 0.089 predictor3 2.156 0.921 2.341 45.67 0.024 0.301 4.011 0.132 Fraction of Missing Information (FMI) -------------------------------------- FMI indicates how much the uncertainty in your estimate is due to missing data: - **FMI = 0**: No missing information, equivalent to complete data analysis - **FMI = 0.1**: 10% of uncertainty is due to missingness - **FMI = 0.5**: Half the uncertainty is from missingness - **FMI = 1**: Complete uncertainty from missingness (rare) If FMI > 0.3, consider using more imputations. Formula Syntax -------------- The ``fit()`` method uses Patsy formula syntax: Basic Formulas ~~~~~~~~~~~~~~ .. code-block:: python # Simple linear regression mice.fit('y ~ x') # Multiple predictors mice.fit('y ~ x1 + x2 + x3') # With interaction mice.fit('y ~ x1 + x2 + x1:x2') # Or equivalently mice.fit('y ~ x1 * x2') # Includes x1, x2, and x1:x2 # Polynomial terms mice.fit('y ~ x + I(x**2)') # No intercept mice.fit('y ~ x - 1') Categorical Variables ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Categorical predictor (automatically creates dummies) mice.fit('income ~ age + C(education)') # Change reference category mice.fit('income ~ age + C(education, Treatment("High School"))') Transformations ~~~~~~~~~~~~~~~ .. code-block:: python # Log transformation mice.fit('log_y ~ x1 + x2') # Use numpy functions mice.fit('y ~ np.log(x1) + np.sqrt(x2)') Advanced Pooling ---------------- Pool Without Summary ~~~~~~~~~~~~~~~~~~~~ Get detailed results for each imputation: .. code-block:: python # Get individual results and pooled results pooled_detailed = mice.pool(summ=False) # Access individual imputation results individual_results = pooled_detailed['individual'] # Access pooled results pooled_results = pooled_detailed['pooled'] Custom Analysis ~~~~~~~~~~~~~~~ For models not supported by ``fit()``, manually fit and pool: .. code-block:: python import numpy as np from sklearn.linear_model import LogisticRegression # Fit on each imputed dataset coefficients = [] std_errors = [] for dataset in mice.imputed_datasets: X = dataset[['predictor1', 'predictor2']] y = dataset['outcome'] model = LogisticRegression() model.fit(X, y) coefficients.append(model.coef_[0]) # Calculate std errors (simplified) # In practice, use proper methods for your model # Pool manually using Rubin's rules from imputation.pooling import pool_estimates pooled = pool_estimates(coefficients, std_errors) Interpreting Pooled Results ---------------------------- Statistical Significance ~~~~~~~~~~~~~~~~~~~~~~~~ Use the pooled p-values and confidence intervals for inference: .. code-block:: python pooled = mice.pool(summ=True) # Check significance significant = pooled[pooled['P>|t|'] < 0.05] print("Significant predictors:") print(significant) The pooled standard errors are larger than those from a single dataset (accounting for imputation uncertainty), so some predictors significant in a single imputation might not be significant when properly pooled. Effect Sizes ~~~~~~~~~~~~ The pooled estimates are your best point estimates: .. code-block:: python # Extract coefficient for predictor1 coef = pooled.loc['predictor1', 'Estimate'] ci_lower = pooled.loc['predictor1', '[0.025'] ci_upper = pooled.loc['predictor1', '0.975]'] print(f"predictor1: {coef:.3f} (95% CI: [{ci_lower:.3f}, {ci_upper:.3f}])") Model Comparison ~~~~~~~~~~~~~~~~ When comparing models, use pooled results: .. code-block:: python # Fit two models mice.fit('y ~ x1') results_simple = mice.pool(summ=True) mice.fit('y ~ x1 + x2 + x3') results_complex = mice.pool(summ=True) # Compare based on pooled coefficients and FMI How Many Imputations? --------------------- General Guidelines ~~~~~~~~~~~~~~~~~~ **Minimum**: 5 imputations Acceptable for low missingness (<10%) **Recommended**: 10-20 imputations Good balance between computation and precision **High missingness**: 20-100 imputations When missingness >30% or FMI >0.3 **Rule of thumb**: Number of imputations ≈ percentage of missing cases Von Hippel (2020) suggests: m = # of missing cases / # of complete cases × 100 Checking If You Have Enough ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If FMI is high (>0.3) and results are unstable across repeated analyses, you may need more imputations: .. code-block:: python # Check FMI pooled = mice.pool(summ=True) max_fmi = pooled['FMI'].max() if max_fmi > 0.3: print(f"High FMI ({max_fmi:.2f}). Consider more imputations.") Common Pitfalls --------------- Don't Use Single Imputation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ❌ **Wrong**: .. code-block:: python # Using only the first imputed dataset dataset = mice.imputed_datasets[0] model = smf.ols('y ~ x1 + x2', data=dataset).fit() print(model.summary()) ✓ **Correct**: .. code-block:: python # Fit on all and pool mice.fit('y ~ x1 + x2') pooled = mice.pool(summ=True) print(pooled) Don't Average Imputed Values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ❌ **Wrong**: .. code-block:: python # Averaging imputed datasets averaged = pd.concat(mice.imputed_datasets).groupby(level=0).mean() model = smf.ols('y ~ x1 + x2', data=averaged).fit() This is actually single imputation and underestimates uncertainty! ✓ **Correct**: Use proper pooling with Rubin's rules Don't Ignore Imputation Uncertainty ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Standard errors from a single imputed dataset are too small. Always pool! Checking Results ---------------- After pooling, check: .. code-block:: python pooled = mice.pool(summ=True) # Summary statistics print(f"Mean FMI: {pooled['FMI'].mean():.3f}") print(f"Max FMI: {pooled['FMI'].max():.3f}") print(f"Mean df: {pooled['df'].mean():.1f}") Tips for Better Pooling ------------------------ 1. **More imputations**: When in doubt, use more (20-50) 2. **Check FMI**: High values suggest need for more imputations 3. **Complete convergence**: Ensure MICE converged before pooling 4. **Include all relevant variables**: In both imputation and analysis models 5. **Be cautious with transformations**: Pool on the analysis scale Next Steps ---------- - Read :doc:`best_practices` for overall guidance - Review :doc:`../theory/rubins_rules` for mathematical details - See complete examples in :doc:`../examples/index`