MICE Overview

This page explains how the MICE (Multiple Imputation by Chained Equations) algorithm works and how to use it in mice-py.

What is MICE?

MICE is an iterative algorithm for imputing missing data that:

  1. Creates multiple imputed datasets (not just one)

  2. Uses chained equations (imputes one variable at a time)

  3. Accounts for uncertainty in the missing values

  4. Enables valid statistical inference under MAR

The algorithm was developed by van Buuren and Groothuis-Oudshoorn (2011) and is implemented in the widely-used R package mice.

How MICE Works

The MICE Algorithm

Step 1: Initialization

Fill in missing values using simple imputation (e.g., random sampling from observed values or means).

Step 2: Iteration

For each variable with missing data (in a specified order):

  1. Set the variable’s imputed values back to missing

  2. Use observed values as the target and other variables as predictors

  3. Fit a model and predict the missing values

  4. Replace missing values with predictions (plus random variation)

Step 3: Repeat

Cycle through all incomplete variables multiple times until convergence.

Step 4: Multiple Imputations

Repeat the entire process to create multiple different completed datasets.

Visual Example

Suppose you have three variables (Age, Income, Education) with missing values:

Iteration 1:
1. Impute Age using Income + Education
2. Impute Income using Age + Education
3. Impute Education using Age + Income

Iteration 2:
1. Re-impute Age using updated Income + Education
2. Re-impute Income using updated Age + Education
3. Re-impute Education using updated Age + Income

... continue until convergence

Basic Usage

Simple Example

from imputation import MICE
import pandas as pd

# Your data with missing values
df = pd.read_csv('data.csv')

# Initialize MICE
mice = MICE(df)

# Run imputation
mice.impute(
    n_imputations=5,   # Create 5 complete datasets
    maxit=10,          # Run 10 iterations
    method='pmm'       # Use Predictive Mean Matching
)

# Access results
imputed_datasets = mice.imputed_datasets

Key Parameters

n_imputations

Number of imputed datasets to create. Common choices: 5-10 for moderate missingness, 20-100 for high missingness or specific analyses.

maxit

Number of iterations through all variables. Usually 10-20 is sufficient. Check convergence diagnostics to determine if more are needed.

method

Imputation method(s) to use. Can be:

  • A string (same method for all variables): 'pmm', 'cart', 'rf', 'midas', 'sample'

  • A dictionary mapping column names to methods

initial

Method for initial imputation before iterations begin:

  • 'sample' (default): Random sampling from observed values

  • 'mean': Use mean for numeric, mode for categorical

visit_sequence

Order to visit variables during each iteration:

  • 'monotone' (default): Ordered by amount of missingness

  • 'random': Random order in each iteration

  • A list of column names for custom order

Controlling the Imputation Process

Predictor Matrix

By default, each variable is predicted using all other variables. You can customize this with a predictor matrix:

import numpy as np

# Create custom predictor matrix
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)

# Exclude certain predictors
predictor_matrix.loc['income', 'education'] = 0  # Don't use education to predict income

mice.impute(predictor_matrix=predictor_matrix)

See Predictor Matrices for more details.

Method-Specific Parameters

Different methods have specific parameters you can tune:

# PMM with custom number of donors
mice.impute(method='pmm', pmm_donors=3)

# CART with maximum depth
mice.impute(method='cart', cart_max_depth=10)

# Random Forest with number of trees
mice.impute(method='rf', rf_n_estimators=50)

Accessing Results

Imputed Datasets

# List of pandas DataFrames
imputed_datasets = mice.imputed_datasets

# Access individual datasets
dataset_1 = imputed_datasets[0]
dataset_2 = imputed_datasets[1]

Convergence Diagnostics

# Mean and variance chains for each variable
chain_mean = mice.chain_mean
chain_var = mice.chain_var

# Visualize
from plotting.diagnostics import plot_chain_stats
plot_chain_stats(chain_mean, chain_var)

See Convergence Diagnostics for details on interpreting these.

Model Fitting and Pooling

After imputation, fit models and pool results:

# Fit a model on all imputed datasets
mice.fit('outcome ~ predictor1 + predictor2')

# Pool using Rubin's rules
results = mice.pool(summ=True)
print(results)

See Pooling Analysis for more on analyzing imputed data.

When MICE Works Well

MICE is effective when:

Data is MAR: Missingness can be predicted from observed variables ✓ Relationships are clear: Variables have predictable relationships ✓ Sufficient data: Enough observed cases to model relationships ✓ Multiple variables: Missing data across several variables ✓ Complex patterns: Non-monotone missingness patterns

Limitations of MICE

Be aware of potential issues:

MNAR data: MICE assumes MAR; with MNAR, results may be biased ✗ High missingness: If >50% missing in key variables, predictions may be unstable ✗ Small samples: Need sufficient data to estimate relationships ✗ Incompatible models: The separate models for each variable may be theoretically

inconsistent (though this rarely causes problems in practice)

Perfect collinearity: Variables with perfect relationships may cause issues

The Imputation Model vs Analysis Model

An important concept: the imputation model (used to fill in missing values) should be at least as complex as your analysis model (the model you’ll fit to the data).

Imputation model: The set of all univariate models used to impute each variable

Analysis model: The model you fit to the completed data (e.g., regression)

Tip

Include all variables that:

  • Are in your analysis model

  • Predict missingness

  • Are correlated with incomplete variables

This ensures the MAR assumption is more plausible and improves imputation quality.

Typical Workflow

  1. Explore your data: Understand patterns and mechanisms of missingness

  2. Configure MICE: Choose methods, predictor matrix, and parameters

  3. Run imputation: Create multiple complete datasets

  4. Check convergence: Ensure the algorithm has stabilized

  5. Diagnose quality: Compare observed vs imputed distributions

  6. Analyze: Fit your statistical model(s)

  7. Pool results: Combine estimates using Rubin’s rules

Example Workflow

from imputation import MICE, configure_logging
from plotting.diagnostics import plot_chain_stats, stripplot
from plotting.utils import md_pattern_like

# Enable logging
configure_logging(level='INFO')

# 1. Explore
pattern = md_pattern_like(df)
print(pattern)

# 2. Configure and run
mice = MICE(df)
mice.impute(n_imputations=5, maxit=15, method='pmm')

# 3. Check convergence
plot_chain_stats(mice.chain_mean, mice.chain_var)

# 4. Diagnose
missing_pattern = df.notna().astype(int)
stripplot(mice.imputed_datasets, missing_pattern)

# 5-7. Analyze and pool
mice.fit('outcome ~ predictor1 + predictor2')
results = mice.pool(summ=True)
print(results)

Next Steps