Method Details
Brief technical overview of the five imputation methods in mice-py.
PMM: Predictive Mean Matching
Algorithm:
Fit Bayesian linear regression on observed cases
Draw parameters from posterior
Generate predictions for observed and missing cases
For each missing value, find k nearest observed cases (by predicted value)
Randomly select one donor and use its observed value
Key feature: Imputed values come from observed data (prevents impossible values).
Best for: Numeric variables, preserving distributions, data with outliers.
Parameters: pmm_donors (default: 5), pmm_matchtype, pmm_ridge
CART: Classification and Regression Trees
Algorithm:
Build decision tree on complete observations
Use tree to predict missing values
Add random variation to predictions
Key feature: Automatically captures interactions and non-linear patterns.
Best for: Non-linear relationships, interactions, categorical variables.
Parameters: cart_max_depth, cart_min_samples_split, cart_min_samples_leaf
Random Forest
Algorithm:
Build multiple decision trees on bootstrap samples
Average predictions across trees
Add random variation to predictions
Key feature: More stable than single tree, handles complexity well.
Best for: Complex patterns, high-dimensional data, many interactions.
Parameters: rf_n_estimators (default: 100), rf_max_depth, rf_max_features
MIDAS: Distance Aided Substitution
Algorithm:
Calculate distances between cases in predictor space
Weight observed cases by inverse distance
Select k donors with highest weights
Use weighted average plus noise
Key feature: Uses local structure of data, good for skewed distributions.
Best for: Small samples, skewed distributions, when PMM struggles.
Parameters: midas_donors (default: 5), midas_ridge
Sample: Random Sampling
Algorithm:
Pool all observed values of the variable
Randomly sample one value for each missing case
Key feature: Simplest method, preserves marginal distribution exactly.
Best for: Initial imputation, categorical variables with many levels, quick exploration.
Parameters: None
Comparison Summary
Method |
Best For |
Speed |
|---|---|---|
PMM |
General purpose numeric |
Fast |
CART |
Non-linear, interactions |
Fast |
RF |
Complex patterns |
Slow |
MIDAS |
Skewed, small samples |
Fast |
Sample |
Quick/simple |
Very fast |
See Imputation Methods for practical selection guidance.