Predictor Matrices
The predictor matrix controls which variables are used to predict (impute) each incomplete variable. This guide explains how predictor matrices work and when to customize them.
What is a Predictor Matrix?
A predictor matrix is a square matrix where:
Rows represent variables to be imputed (target variables)
Columns represent predictor variables
A value of 1 means “use this column to predict this row”
A value of 0 means “don’t use this column to predict this row”
The diagonal is always 0 (a variable doesn’t predict itself)
Example
import pandas as pd
import numpy as np
# Sample predictor matrix for variables: age, income, education
predictor_matrix = pd.DataFrame(
[[0, 1, 1], # To impute age, use income and education
[1, 0, 1], # To impute income, use age and education
[1, 1, 0]], # To impute education, use age and income
index=['age', 'income', 'education'],
columns=['age', 'income', 'education']
)
print(predictor_matrix)
Output:
age income education
age 0 1 1
income 1 0 1
education 1 1 0
Default Behavior
If you don’t specify a predictor matrix, MICE uses all other variables as predictors for each incomplete variable:
mice = MICE(df)
mice.impute(n_imputations=5) # Uses default predictor matrix
This is equivalent to:
# Create full predictor matrix
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)
mice.impute(predictor_matrix=predictor_matrix)
When to Customize
You should customize the predictor matrix when:
Variables shouldn’t predict each other (logical constraints)
Too many predictors cause computational issues
Known causal relationships suggest specific prediction structures
Auxiliary variables should be used for prediction but not imputed
Multicollinearity between predictors causes problems
Creating Custom Predictor Matrices
Start with Default
# Start with all-ones matrix
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)
Then modify as needed.
Exclude Specific Predictors
Prevent one variable from predicting another:
# Don't use education to predict income
predictor_matrix.loc['income', 'education'] = 0
Use Only Specific Predictors
# Use only age and income to predict education
predictor_matrix.loc['education', :] = 0 # First, exclude all
predictor_matrix.loc['education', ['age', 'income']] = 1 # Then include specific
Block of Variables
# Don't use any demographic variables to predict health outcomes
demographic_vars = ['age', 'gender', 'ethnicity']
health_vars = ['blood_pressure', 'cholesterol', 'bmi']
for health_var in health_vars:
predictor_matrix.loc[health_var, demographic_vars] = 0
Common Patterns
Include Complete Variables
Complete variables (no missing values) can be used as predictors but don’t need to be imputed:
# Identify complete variables
complete_vars = df.columns[df.isnull().sum() == 0].tolist()
incomplete_vars = df.columns[df.isnull().sum() > 0].tolist()
# Create predictor matrix only for incomplete variables
predictor_matrix = pd.DataFrame(
1,
index=incomplete_vars,
columns=df.columns # Use all variables as predictors
)
# Set diagonal to 0
for var in incomplete_vars:
predictor_matrix.loc[var, var] = 0
Auxiliary Variables
Variables that help prediction but aren’t part of your analysis model:
# Suppose 'auxiliary_score' helps predict income but won't be in final model
# Include it as predictor for income
predictor_matrix.loc['income', 'auxiliary_score'] = 1
# But don't impute it if missing
# (Remove from rows if it has missing values you don't care about)
Quickpred: Automatic Predictor Selection
For datasets with many variables, the quickpred algorithm automatically selects
predictors based on correlations:
from imputation.utils import quickpred
# Automatically select predictors
predictor_matrix = quickpred(
df,
mincor=0.1, # Minimum correlation
minpuc=0.0, # Minimum proportion of usable cases
include=None, # Variables to always include
exclude=None # Variables to always exclude
)
mice.impute(predictor_matrix=predictor_matrix)
Parameters:
mincor: Only use predictors with absolute correlation >= this threshold
minpuc: Require minimum proportion of usable complete cases
include: List of variables to always include as predictors
exclude: List of variables to never use as predictors
Monotone Patterns
If your data has a monotone missing pattern, you can use a block structure:
# Variables ordered by missingness: time1, time2, time3, time4
# time2 can only use time1; time3 can use time1-2; etc.
predictor_matrix = pd.DataFrame(0, index=df.columns, columns=df.columns)
predictor_matrix.loc['time2', 'time1'] = 1
predictor_matrix.loc['time3', ['time1', 'time2']] = 1
predictor_matrix.loc['time4', ['time1', 'time2', 'time3']] = 1
This respects the temporal structure of the data.
Practical Examples
Example 1: Exclude Future Predictors
In longitudinal data, future values shouldn’t predict past values:
# Time-ordered variables
time_vars = ['baseline', 'month3', 'month6', 'month12']
predictor_matrix = pd.DataFrame(1, index=time_vars, columns=time_vars)
np.fill_diagonal(predictor_matrix.values, 0)
# Exclude future predictors
for i, target in enumerate(time_vars):
for predictor in time_vars[i+1:]:
predictor_matrix.loc[target, predictor] = 0
Example 2: Separate Domains
Variables from different domains might not predict each other well:
physical_health = ['height', 'weight', 'blood_pressure']
mental_health = ['depression_score', 'anxiety_score']
demographics = ['age', 'gender', 'education']
# Demographics predict everything
# But physical and mental health don't predict each other
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)
# Physical doesn't predict mental
for phys in physical_health:
for mental in mental_health:
predictor_matrix.loc[mental, phys] = 0
# Mental doesn't predict physical
for mental in mental_health:
for phys in physical_health:
predictor_matrix.loc[phys, mental] = 0
Example 3: High-Dimensional Data
With many variables, use only the most correlated:
from imputation.utils import quickpred
# Select predictors with correlation >= 0.3
predictor_matrix = quickpred(df, mincor=0.3)
# Always include key variables
key_vars = ['age', 'treatment_group']
for var in df.columns:
if var not in key_vars:
for key_var in key_vars:
predictor_matrix.loc[var, key_var] = 1
Checking Your Predictor Matrix
Before running MICE, verify your predictor matrix:
# Check dimensions
print(f"Shape: {predictor_matrix.shape}")
# Check diagonal is zero
assert (np.diag(predictor_matrix.values) == 0).all(), "Diagonal should be 0"
# Check each incomplete variable has at least one predictor
print("Number of predictors per variable:")
print(predictor_matrix.sum(axis=1))
# Visualize
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
sns.heatmap(predictor_matrix, cmap='RdYlGn', center=0.5,
cbar_kws={'label': 'Use as predictor'})
plt.title('Predictor Matrix')
plt.tight_layout()
plt.savefig('predictor_matrix.png')
Common Issues
Too Few Predictors
Problem: Variables have insufficient predictors, leading to poor imputations.
- Solution:
Ensure each variable has at least 2-3 relevant predictors
Use the default full matrix if unsure
Too Many Predictors
Problem: Model fitting is slow or fails due to multicollinearity.
- Solution:
Use
quickpred()to select based on correlationsManually remove redundant predictors
Increase ridge parameter in PMM
Circular Dependencies
Problem: Worried about A predicting B and B predicting A.
- Solution:
This is actually fine in MICE! The algorithm handles it iteratively.
Only exclude if there’s a logical reason (e.g., temporal ordering)
Tips and Best Practices
Start simple: Use the default full matrix first
Be conservative: Only exclude predictors if you have good reason
Include analysis model variables: All variables in your final analysis model should be used as predictors
Use auxiliary variables: Variables that help prediction even if not in your analysis model
Respect time order: In longitudinal data, don’t let future predict past
Check convergence: Restrictive predictor matrices may slow convergence
Testing Your Choices
Compare imputation quality with different predictor matrices:
# Default: all predictors
mice_full = MICE(df)
mice_full.impute(n_imputations=5)
# Custom: restricted predictors
mice_custom = MICE(df)
mice_custom.impute(n_imputations=5, predictor_matrix=custom_matrix)
# Compare convergence
from plotting.diagnostics import plot_chain_stats
plot_chain_stats(mice_full.chain_mean, mice_full.chain_var,
save_path='full_convergence.png')
plot_chain_stats(mice_custom.chain_mean, mice_custom.chain_var,
save_path='custom_convergence.png')
Next Steps
Learn about Convergence Diagnostics to check if your predictor matrix works well
See Best Practices for overall guidance
Try examples in Examples