Imputation Methods

mice-py provides five different imputation methods. This guide helps you choose and configure the right method for your data.

Overview of Methods

Method	Best For	Data Type	Preserves	Complexity
PMM	General purpose	Numeric	Distribution	Low
CART	Non-linear	Both	Interactions	Medium
Random Forest	Complex patterns	Both	Interactions	High
MIDAS	Small samples	Numeric	Local patterns	Low
Sample	Quick & simple	Both	Observed values	Very Low

PMM: Predictive Mean Matching

When to use: Default choice for numeric data, especially when preserving the original distribution is important.

How It Works

Fit a Bayesian linear regression on observed values
Generate predictions for both observed and missing values
For each missing value, find the k closest observed values (donors) based on predicted values
Randomly select one donor and use its observed value as the imputed value

Key advantage: Imputed values are always from the observed data, so impossible values cannot be generated.

Usage

# Basic usage
mice.impute(method='pmm')

# With custom parameters
mice.impute(
    method='pmm',
    pmm_donors=5,         # Number of donor candidates (default: 5)
    pmm_matchtype=1,      # Matching type (0, 1, or 2)
    pmm_ridge=1e-5        # Ridge regularization parameter
)

Parameters

donors (default: 5)

Number of closest donors to consider. Larger values increase variability; smaller values make imputations more deterministic.

matchtype (default: 1)

0: Match predicted values (no randomness)
1: Match using drawn parameter values (default, adds uncertainty)
2: Maximum randomness

ridge (default: 1e-5)

Regularization to stabilize estimation with collinear predictors.

When PMM Works Best

✓ Numeric data with moderate to large sample size ✓ Preserving distribution properties is important ✓ Data has outliers that should be preserved ✓ MAR mechanism with linear relationships

Limitations

✗ Only generates values already in the data ✗ Assumes approximately linear relationships ✗ May struggle with highly skewed data ✗ Not suitable for categorical variables

CART: Classification and Regression Trees

When to use: Data with non-linear relationships or interactions between variables.

How It Works

Build a decision tree using complete observations
For classification (categorical): predict class probabilities
For regression (numeric): predict values
Add appropriate random variation to predictions

Key advantage: Automatically captures interactions and non-linear patterns without needing to specify them.

Usage

# Basic usage
mice.impute(method='cart')

# With custom parameters
mice.impute(
    method='cart',
    cart_max_depth=None,        # Maximum tree depth
    cart_min_samples_split=2,   # Min samples to split
    cart_min_samples_leaf=1     # Min samples in leaf
)

Parameters

max_depth (default: None): Maximum depth of the tree. None allows unlimited depth. Use smaller values (e.g., 10-20) to prevent overfitting.
min_samples_split (default: 2): Minimum samples required to split an internal node.
min_samples_leaf (default: 1): Minimum samples required at a leaf node.

When CART Works Best

✓ Non-linear relationships ✓ Interaction effects between variables ✓ Mixed data types (numeric and categorical) ✓ Robust to outliers ✓ Categorical variables with many levels

Limitations

✗ Can overfit with small samples ✗ May not preserve distribution as well as PMM ✗ Less stable than other methods (high variance)

Random Forest

When to use: Complex data with many interactions and non-linear relationships.

How It Works

Build an ensemble of decision trees using bootstrap samples
Each tree uses a random subset of predictors
Average predictions across all trees
Add random variation appropriate for the data type

Key advantage: More stable and accurate than CART, especially with complex patterns.

Usage

# Basic usage
mice.impute(method='rf')

# With custom parameters
mice.impute(
    method='rf',
    rf_n_estimators=100,     # Number of trees
    rf_max_depth=None,       # Maximum depth
    rf_min_samples_split=2,  # Min samples to split
    rf_max_features='sqrt'   # Features per split
)

Parameters

n_estimators (default: 100): Number of trees in the forest. More trees = more stable but slower.
max_depth (default: None): Maximum depth of each tree.
max_features (default: ‘sqrt’): Number of features to consider for each split. Options: ‘sqrt’, ‘log2’, or an integer.

When Random Forest Works Best

✓ Complex, non-linear relationships ✓ Many interaction effects ✓ Large datasets ✓ High-dimensional data ✓ Mixed data types ✓ When accuracy is more important than interpretability

Limitations

✗ Computationally expensive ✗ Slower than other methods ✗ Less interpretable than simpler methods ✗ May not preserve marginal distributions as well as PMM

MIDAS: Multiple Imputation with Distant Average Substitution

When to use: Numeric data, especially with small samples or skewed distributions.

How It Works

For each missing value, identify nearby observed values using distance metrics
Use a weighted average of distant donors (farther donors get less weight)
Add random variation

Key advantage: Often performs well with small samples and skewed distributions where PMM struggles.

Usage

# Basic usage
mice.impute(method='midas')

# With custom parameters
mice.impute(
    method='midas',
    midas_donors=5,      # Number of donors
    midas_ridge=1e-5     # Ridge parameter
)

When MIDAS Works Best

✓ Small sample sizes ✓ Skewed distributions ✓ Numeric data ✓ When PMM struggles with distribution

Limitations

✗ Only for numeric variables ✗ Less commonly used (less validated than PMM/CART/RF) ✗ May require parameter tuning

Sample: Random Sampling

When to use: Quick imputations, initial values, or when other methods aren’t suitable.

How It Works

Simply draws random values from the observed values of each variable.

Key advantage: Very fast, simple, preserves observed distribution exactly.

Usage

mice.impute(method='sample')

When Sample Works Best

✓ Initial imputation (before MICE iterations) ✓ Categorical variables with many levels ✓ Quick exploratory analysis ✓ When no predictive relationship exists

Limitations

✗ Ignores relationships between variables ✗ No predictive component ✗ Only useful for simple cases or initialization

Choosing a Method

Decision Tree

Is your data numeric or categorical?
│
├── Mostly numeric
│   │
│   ├── Linear relationships? → PMM
│   │
│   ├── Non-linear? → CART or RF
│   │
│   └── Small sample or skewed? → MIDAS or PMM
│
└── Mixed or mostly categorical
    │
    ├── Simple relationships? → CART
    │
    └── Complex interactions? → RF

General Guidelines

Start with PMM: It’s the most well-studied method and works well in most cases.
Use CART for interactions: If you know or suspect important interactions between variables.
Use RF for complexity: When you have complex patterns and computational resources.
Use MIDAS when PMM fails: Particularly with small samples or skewed data.
Use Sample for initialization: Or for very simple cases.

Using Different Methods for Different Variables

You can use different methods for different variables:

method_dict = {
    'age': 'pmm',           # Numeric, approximately normal
    'income': 'midas',      # Numeric, highly skewed
    'education': 'sample',  # Categorical with few levels
    'job_type': 'cart',     # Categorical with many levels
    'health_score': 'rf'    # Numeric with complex patterns
}

mice.impute(n_imputations=5, method=method_dict)

Comparing Methods

To compare methods empirically:

from plotting.diagnostics import densityplot, stripplot

# Try PMM
mice_pmm = MICE(df)
mice_pmm.impute(method='pmm')

# Try CART
mice_cart = MICE(df)
mice_cart.impute(method='cart')

# Compare distributions
missing_pattern = df.notna().astype(int)
densityplot(mice_pmm.imputed_datasets, missing_pattern,
            save_path='pmm_density.png')
densityplot(mice_cart.imputed_datasets, missing_pattern,
            save_path='cart_density.png')

Look for:

How well imputed values match observed distribution
Whether extreme values are reasonable
Smooth transitions between observed and imputed

Research Findings

Based on simulation studies in the thesis:

PMM performs reliably under MCAR and mild MAR with symmetric distributions
MIDAS consistently matches or outperforms PMM with skewness or small samples
CART/RF handle non-linear relationships effectively but may not preserve marginal distributions as well
Method choice should consider data characteristics, missingness patterns, and sample size

Next Steps

Learn about Predictor Matrices to control which variables predict which
Check Convergence Diagnostics after imputation
See practical examples in Examples