Method Details

Brief technical overview of the five imputation methods in mice-py.

PMM: Predictive Mean Matching

Algorithm:

  1. Fit Bayesian linear regression on observed cases

  2. Draw parameters from posterior

  3. Generate predictions for observed and missing cases

  4. For each missing value, find k nearest observed cases (by predicted value)

  5. Randomly select one donor and use its observed value

Key feature: Imputed values come from observed data (prevents impossible values).

Best for: Numeric variables, preserving distributions, data with outliers.

Parameters: pmm_donors (default: 5), pmm_matchtype, pmm_ridge

CART: Classification and Regression Trees

Algorithm:

  1. Build decision tree on complete observations

  2. Use tree to predict missing values

  3. Add random variation to predictions

Key feature: Automatically captures interactions and non-linear patterns.

Best for: Non-linear relationships, interactions, categorical variables.

Parameters: cart_max_depth, cart_min_samples_split, cart_min_samples_leaf

Random Forest

Algorithm:

  1. Build multiple decision trees on bootstrap samples

  2. Average predictions across trees

  3. Add random variation to predictions

Key feature: More stable than single tree, handles complexity well.

Best for: Complex patterns, high-dimensional data, many interactions.

Parameters: rf_n_estimators (default: 100), rf_max_depth, rf_max_features

MIDAS: Distance Aided Substitution

Algorithm:

  1. Calculate distances between cases in predictor space

  2. Weight observed cases by inverse distance

  3. Select k donors with highest weights

  4. Use weighted average plus noise

Key feature: Uses local structure of data, good for skewed distributions.

Best for: Small samples, skewed distributions, when PMM struggles.

Parameters: midas_donors (default: 5), midas_ridge

Sample: Random Sampling

Algorithm:

  1. Pool all observed values of the variable

  2. Randomly sample one value for each missing case

Key feature: Simplest method, preserves marginal distribution exactly.

Best for: Initial imputation, categorical variables with many levels, quick exploration.

Parameters: None

Comparison Summary

Method

Best For

Speed

PMM

General purpose numeric

Fast

CART

Non-linear, interactions

Fast

RF

Complex patterns

Slow

MIDAS

Skewed, small samples

Fast

Sample

Quick/simple

Very fast

See Imputation Methods for practical selection guidance.