Best Practices
Quick reference for using MICE correctly.
Basic Setup
from imputation import MICE, configure_logging
from plotting.diagnostics import plot_chain_stats
from plotting.utils import md_pattern_like
# Enable logging
configure_logging(level='INFO')
# Check missing patterns
pattern = md_pattern_like(df)
print(pattern)
# Set random seed for reproducibility
import numpy as np
np.random.seed(42)
Recommended Parameters
mice = MICE(df)
mice.impute(
n_imputations=20, # Use 20+, not just 5
maxit=20, # Ensure convergence
method='pmm' # Start with PMM
)
- Number of imputations:
Minimum: 5 (only for low missingness)
Recommended: 20
High missingness (>30%): 50-100
- Number of iterations:
Minimum: 10
Recommended: 20
Check convergence to confirm sufficiency
Method Selection
# Use different methods for different variables
method_dict = {
'numeric_normal': 'pmm',
'numeric_skewed': 'midas',
'categorical': 'cart',
'complex': 'rf'
}
mice.impute(method=method_dict)
- Quick guide:
PMM: General numeric data
CART: Categorical or non-linear
RF: Complex patterns
MIDAS: Skewed or small samples
Sample: Initial imputation or simple cases
Always Check Convergence
from plotting.diagnostics import plot_chain_stats
plot_chain_stats(mice.chain_mean, mice.chain_var,
save_path='convergence.png')
Look for flat, stable chains in later iterations.
Compare Distributions
from plotting.diagnostics import stripplot, densityplot
missing_pattern = df.notna().astype(int)
stripplot(mice.imputed_datasets, missing_pattern)
densityplot(mice.imputed_datasets, missing_pattern)
Ensure imputed values are within reasonable range.
Proper Pooling
Always use Rubin’s rules:
# Correct
mice.fit('outcome ~ predictor')
pooled = mice.pool(summ=True)
print(pooled)
# Check FMI
print(f"Max FMI: {pooled['FMI'].max():.3f}")
- Never:
Use only one imputed dataset
Average the imputed datasets
Use standard analysis on single imputation
Common Mistakes
Mistake 1: Too few imputations
# Don't do this
mice.impute(n_imputations=5) # Often not enough
# Do this
mice.impute(n_imputations=20) # Better
Mistake 2: Not checking convergence
# Always check
plot_chain_stats(mice.chain_mean, mice.chain_var)
Mistake 3: Using single imputation
# Don't do this
single_dataset = mice.imputed_datasets[0]
model = fit_model(single_dataset) # Wrong!
# Do this
mice.fit('y ~ x')
pooled = mice.pool(summ=True) # Correct
Mistake 4: Imputing after transformations
# Don't do this
df['log_income'] = np.log(df['income'])
mice = MICE(df) # Imputes both income and log_income separately!
# Do this
mice = MICE(df[['income', 'other_vars']])
mice.impute(n_imputations=20)
# Then create transformations after imputation
for dataset in mice.imputed_datasets:
dataset['log_income'] = np.log(dataset['income'])
Variable Selection
- Include in imputation:
All variables in your analysis model
Variables that predict missingness
Variables correlated with incomplete variables
# If analyzing: income ~ age + education
# Impute: income, age, education, plus auxiliary variables
mice = MICE(df[['income', 'age', 'education',
'occupation', 'zip_code']]) # auxiliaries
Predictor Matrix
For most cases, use default (all variables predict each other):
mice.impute(n_imputations=20) # Uses default predictor matrix
For custom control:
import numpy as np
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)
# Customize as needed
predictor_matrix.loc['var1', 'var2'] = 0
mice.impute(predictor_matrix=predictor_matrix)
See Predictor Matrices for details.
Performance Tips
For large datasets:
# Use faster methods
mice.impute(method='cart') # Faster than RF
# Use quickpred to reduce predictors
from imputation.utils import quickpred
pred_matrix = quickpred(df, mincor=0.3)
mice.impute(predictor_matrix=pred_matrix)
For many variables:
# Automatic predictor selection
pred_matrix = quickpred(df, mincor=0.2, minpuc=0.1)
mice.impute(predictor_matrix=pred_matrix)
Essential Checklist
Before finalizing:
☐ Checked missing patterns ☐ Used appropriate methods ☐ Ran sufficient iterations (≥20) ☐ Created enough imputations (≥20) ☐ Checked convergence ☐ Compared observed vs imputed distributions ☐ Used proper pooling (Rubin’s rules) ☐ Set random seed for reproducibility
See Also
MICE Overview - How MICE works
Imputation Methods - Method details
Convergence Diagnostics - Checking convergence
Pooling Analysis - Analyzing results