MICE Overview ============= This page explains how the MICE (Multiple Imputation by Chained Equations) algorithm works and how to use it in mice-py. What is MICE? ------------- MICE is an iterative algorithm for imputing missing data that: 1. Creates **multiple** imputed datasets (not just one) 2. Uses **chained equations** (imputes one variable at a time) 3. Accounts for **uncertainty** in the missing values 4. Enables **valid statistical inference** under MAR The algorithm was developed by van Buuren and Groothuis-Oudshoorn (2011) and is implemented in the widely-used R package ``mice``. How MICE Works -------------- The MICE Algorithm ~~~~~~~~~~~~~~~~~~ **Step 1: Initialization** Fill in missing values using simple imputation (e.g., random sampling from observed values or means). **Step 2: Iteration** For each variable with missing data (in a specified order): a. Set the variable's imputed values back to missing b. Use observed values as the target and other variables as predictors c. Fit a model and predict the missing values d. Replace missing values with predictions (plus random variation) **Step 3: Repeat** Cycle through all incomplete variables multiple times until convergence. **Step 4: Multiple Imputations** Repeat the entire process to create multiple different completed datasets. Visual Example ~~~~~~~~~~~~~~ Suppose you have three variables (Age, Income, Education) with missing values: .. code-block:: text Iteration 1: 1. Impute Age using Income + Education 2. Impute Income using Age + Education 3. Impute Education using Age + Income Iteration 2: 1. Re-impute Age using updated Income + Education 2. Re-impute Income using updated Age + Education 3. Re-impute Education using updated Age + Income ... continue until convergence Basic Usage ----------- Simple Example ~~~~~~~~~~~~~~ .. code-block:: python from imputation import MICE import pandas as pd # Your data with missing values df = pd.read_csv('data.csv') # Initialize MICE mice = MICE(df) # Run imputation mice.impute( n_imputations=5, # Create 5 complete datasets maxit=10, # Run 10 iterations method='pmm' # Use Predictive Mean Matching ) # Access results imputed_datasets = mice.imputed_datasets Key Parameters ~~~~~~~~~~~~~~ **n_imputations** Number of imputed datasets to create. Common choices: 5-10 for moderate missingness, 20-100 for high missingness or specific analyses. **maxit** Number of iterations through all variables. Usually 10-20 is sufficient. Check convergence diagnostics to determine if more are needed. **method** Imputation method(s) to use. Can be: - A string (same method for all variables): ``'pmm'``, ``'cart'``, ``'rf'``, ``'midas'``, ``'sample'`` - A dictionary mapping column names to methods **initial** Method for initial imputation before iterations begin: - ``'sample'`` (default): Random sampling from observed values - ``'mean'``: Use mean for numeric, mode for categorical **visit_sequence** Order to visit variables during each iteration: - ``'monotone'`` (default): Ordered by amount of missingness - ``'random'``: Random order in each iteration - A list of column names for custom order Controlling the Imputation Process ----------------------------------- Predictor Matrix ~~~~~~~~~~~~~~~~ By default, each variable is predicted using all other variables. You can customize this with a predictor matrix: .. code-block:: python import numpy as np # Create custom predictor matrix predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns) np.fill_diagonal(predictor_matrix.values, 0) # Exclude certain predictors predictor_matrix.loc['income', 'education'] = 0 # Don't use education to predict income mice.impute(predictor_matrix=predictor_matrix) See :doc:`predictor_matrices` for more details. Method-Specific Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different methods have specific parameters you can tune: .. code-block:: python # PMM with custom number of donors mice.impute(method='pmm', pmm_donors=3) # CART with maximum depth mice.impute(method='cart', cart_max_depth=10) # Random Forest with number of trees mice.impute(method='rf', rf_n_estimators=50) Accessing Results ----------------- Imputed Datasets ~~~~~~~~~~~~~~~~ .. code-block:: python # List of pandas DataFrames imputed_datasets = mice.imputed_datasets # Access individual datasets dataset_1 = imputed_datasets[0] dataset_2 = imputed_datasets[1] Convergence Diagnostics ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Mean and variance chains for each variable chain_mean = mice.chain_mean chain_var = mice.chain_var # Visualize from plotting.diagnostics import plot_chain_stats plot_chain_stats(chain_mean, chain_var) See :doc:`convergence_diagnostics` for details on interpreting these. Model Fitting and Pooling ~~~~~~~~~~~~~~~~~~~~~~~~~~ After imputation, fit models and pool results: .. code-block:: python # Fit a model on all imputed datasets mice.fit('outcome ~ predictor1 + predictor2') # Pool using Rubin's rules results = mice.pool(summ=True) print(results) See :doc:`pooling_analysis` for more on analyzing imputed data. When MICE Works Well -------------------- MICE is effective when: ✓ **Data is MAR**: Missingness can be predicted from observed variables ✓ **Relationships are clear**: Variables have predictable relationships ✓ **Sufficient data**: Enough observed cases to model relationships ✓ **Multiple variables**: Missing data across several variables ✓ **Complex patterns**: Non-monotone missingness patterns Limitations of MICE ------------------- Be aware of potential issues: ✗ **MNAR data**: MICE assumes MAR; with MNAR, results may be biased ✗ **High missingness**: If >50% missing in key variables, predictions may be unstable ✗ **Small samples**: Need sufficient data to estimate relationships ✗ **Incompatible models**: The separate models for each variable may be theoretically inconsistent (though this rarely causes problems in practice) ✗ **Perfect collinearity**: Variables with perfect relationships may cause issues The Imputation Model vs Analysis Model --------------------------------------- An important concept: the **imputation model** (used to fill in missing values) should be at least as complex as your **analysis model** (the model you'll fit to the data). **Imputation model**: The set of all univariate models used to impute each variable **Analysis model**: The model you fit to the completed data (e.g., regression) .. tip:: Include all variables that: - Are in your analysis model - Predict missingness - Are correlated with incomplete variables This ensures the MAR assumption is more plausible and improves imputation quality. Typical Workflow ---------------- 1. **Explore your data**: Understand patterns and mechanisms of missingness 2. **Configure MICE**: Choose methods, predictor matrix, and parameters 3. **Run imputation**: Create multiple complete datasets 4. **Check convergence**: Ensure the algorithm has stabilized 5. **Diagnose quality**: Compare observed vs imputed distributions 6. **Analyze**: Fit your statistical model(s) 7. **Pool results**: Combine estimates using Rubin's rules Example Workflow ~~~~~~~~~~~~~~~~ .. code-block:: python from imputation import MICE, configure_logging from plotting.diagnostics import plot_chain_stats, stripplot from plotting.utils import md_pattern_like # Enable logging configure_logging(level='INFO') # 1. Explore pattern = md_pattern_like(df) print(pattern) # 2. Configure and run mice = MICE(df) mice.impute(n_imputations=5, maxit=15, method='pmm') # 3. Check convergence plot_chain_stats(mice.chain_mean, mice.chain_var) # 4. Diagnose missing_pattern = df.notna().astype(int) stripplot(mice.imputed_datasets, missing_pattern) # 5-7. Analyze and pool mice.fit('outcome ~ predictor1 + predictor2') results = mice.pool(summ=True) print(results) Next Steps ---------- - Learn about different :doc:`imputation_methods` and when to use each - Understand :doc:`predictor_matrices` for fine control - Read about :doc:`convergence_diagnostics` to ensure quality - See :doc:`pooling_analysis` for analyzing imputed data