Predictor Matrices

The predictor matrix controls which variables are used to predict (impute) each incomplete variable. This guide explains how predictor matrices work and when to customize them.

What is a Predictor Matrix?

A predictor matrix is a square matrix where:

Rows represent variables to be imputed (target variables)
Columns represent predictor variables
A value of 1 means “use this column to predict this row”
A value of 0 means “don’t use this column to predict this row”
The diagonal is always 0 (a variable doesn’t predict itself)

Example

import pandas as pd
import numpy as np

# Sample predictor matrix for variables: age, income, education
predictor_matrix = pd.DataFrame(
    [[0, 1, 1],   # To impute age, use income and education
     [1, 0, 1],   # To impute income, use age and education
     [1, 1, 0]],  # To impute education, use age and income
    index=['age', 'income', 'education'],
    columns=['age', 'income', 'education']
)

print(predictor_matrix)

Output:

         age  income  education
age        0       1          1
income     1       0          1
education  1       1          0

Default Behavior

If you don’t specify a predictor matrix, MICE uses all other variables as predictors for each incomplete variable:

mice = MICE(df)
mice.impute(n_imputations=5)  # Uses default predictor matrix

This is equivalent to:

# Create full predictor matrix
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)

mice.impute(predictor_matrix=predictor_matrix)

When to Customize

You should customize the predictor matrix when:

Variables shouldn’t predict each other (logical constraints)
Too many predictors cause computational issues
Known causal relationships suggest specific prediction structures
Auxiliary variables should be used for prediction but not imputed
Multicollinearity between predictors causes problems

Creating Custom Predictor Matrices

Start with Default

# Start with all-ones matrix
predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)

Then modify as needed.

Exclude Specific Predictors

Prevent one variable from predicting another:

# Don't use education to predict income
predictor_matrix.loc['income', 'education'] = 0

Use Only Specific Predictors

# Use only age and income to predict education
predictor_matrix.loc['education', :] = 0  # First, exclude all
predictor_matrix.loc['education', ['age', 'income']] = 1  # Then include specific

Block of Variables

# Don't use any demographic variables to predict health outcomes
demographic_vars = ['age', 'gender', 'ethnicity']
health_vars = ['blood_pressure', 'cholesterol', 'bmi']

for health_var in health_vars:
    predictor_matrix.loc[health_var, demographic_vars] = 0

Common Patterns

Include Complete Variables

Complete variables (no missing values) can be used as predictors but don’t need to be imputed:

# Identify complete variables
complete_vars = df.columns[df.isnull().sum() == 0].tolist()
incomplete_vars = df.columns[df.isnull().sum() > 0].tolist()

# Create predictor matrix only for incomplete variables
predictor_matrix = pd.DataFrame(
    1,
    index=incomplete_vars,
    columns=df.columns  # Use all variables as predictors
)

# Set diagonal to 0
for var in incomplete_vars:
    predictor_matrix.loc[var, var] = 0

Auxiliary Variables

Variables that help prediction but aren’t part of your analysis model:

# Suppose 'auxiliary_score' helps predict income but won't be in final model
# Include it as predictor for income
predictor_matrix.loc['income', 'auxiliary_score'] = 1

# But don't impute it if missing
# (Remove from rows if it has missing values you don't care about)

Quickpred: Automatic Predictor Selection

For datasets with many variables, the quickpred algorithm automatically selects predictors based on correlations:

from imputation.utils import quickpred

# Automatically select predictors
predictor_matrix = quickpred(
    df,
    mincor=0.1,    # Minimum correlation
    minpuc=0.0,    # Minimum proportion of usable cases
    include=None,  # Variables to always include
    exclude=None   # Variables to always exclude
)

mice.impute(predictor_matrix=predictor_matrix)

Parameters:

mincor: Only use predictors with absolute correlation >= this threshold
minpuc: Require minimum proportion of usable complete cases
include: List of variables to always include as predictors
exclude: List of variables to never use as predictors

Monotone Patterns

If your data has a monotone missing pattern, you can use a block structure:

# Variables ordered by missingness: time1, time2, time3, time4
# time2 can only use time1; time3 can use time1-2; etc.

predictor_matrix = pd.DataFrame(0, index=df.columns, columns=df.columns)

predictor_matrix.loc['time2', 'time1'] = 1
predictor_matrix.loc['time3', ['time1', 'time2']] = 1
predictor_matrix.loc['time4', ['time1', 'time2', 'time3']] = 1

This respects the temporal structure of the data.

Practical Examples

Example 1: Exclude Future Predictors

In longitudinal data, future values shouldn’t predict past values:

# Time-ordered variables
time_vars = ['baseline', 'month3', 'month6', 'month12']

predictor_matrix = pd.DataFrame(1, index=time_vars, columns=time_vars)
np.fill_diagonal(predictor_matrix.values, 0)

# Exclude future predictors
for i, target in enumerate(time_vars):
    for predictor in time_vars[i+1:]:
        predictor_matrix.loc[target, predictor] = 0

Example 2: Separate Domains

Variables from different domains might not predict each other well:

physical_health = ['height', 'weight', 'blood_pressure']
mental_health = ['depression_score', 'anxiety_score']
demographics = ['age', 'gender', 'education']

# Demographics predict everything
# But physical and mental health don't predict each other

predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(predictor_matrix.values, 0)

# Physical doesn't predict mental
for phys in physical_health:
    for mental in mental_health:
        predictor_matrix.loc[mental, phys] = 0

# Mental doesn't predict physical
for mental in mental_health:
    for phys in physical_health:
        predictor_matrix.loc[phys, mental] = 0

Example 3: High-Dimensional Data

With many variables, use only the most correlated:

from imputation.utils import quickpred

# Select predictors with correlation >= 0.3
predictor_matrix = quickpred(df, mincor=0.3)

# Always include key variables
key_vars = ['age', 'treatment_group']
for var in df.columns:
    if var not in key_vars:
        for key_var in key_vars:
            predictor_matrix.loc[var, key_var] = 1

Checking Your Predictor Matrix

Before running MICE, verify your predictor matrix:

# Check dimensions
print(f"Shape: {predictor_matrix.shape}")

# Check diagonal is zero
assert (np.diag(predictor_matrix.values) == 0).all(), "Diagonal should be 0"

# Check each incomplete variable has at least one predictor
print("Number of predictors per variable:")
print(predictor_matrix.sum(axis=1))

# Visualize
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(predictor_matrix, cmap='RdYlGn', center=0.5,
            cbar_kws={'label': 'Use as predictor'})
plt.title('Predictor Matrix')
plt.tight_layout()
plt.savefig('predictor_matrix.png')

Common Issues

Too Few Predictors

Problem: Variables have insufficient predictors, leading to poor imputations.

Solution:

Ensure each variable has at least 2-3 relevant predictors
Use the default full matrix if unsure

Too Many Predictors

Problem: Model fitting is slow or fails due to multicollinearity.

Solution:

Use quickpred() to select based on correlations
Manually remove redundant predictors
Increase ridge parameter in PMM

Circular Dependencies

Problem: Worried about A predicting B and B predicting A.

Solution:

This is actually fine in MICE! The algorithm handles it iteratively.
Only exclude if there’s a logical reason (e.g., temporal ordering)

Tips and Best Practices

Start simple: Use the default full matrix first
Be conservative: Only exclude predictors if you have good reason
Include analysis model variables: All variables in your final analysis model should be used as predictors
Use auxiliary variables: Variables that help prediction even if not in your analysis model
Respect time order: In longitudinal data, don’t let future predict past
Check convergence: Restrictive predictor matrices may slow convergence

Testing Your Choices

Compare imputation quality with different predictor matrices:

# Default: all predictors
mice_full = MICE(df)
mice_full.impute(n_imputations=5)

# Custom: restricted predictors
mice_custom = MICE(df)
mice_custom.impute(n_imputations=5, predictor_matrix=custom_matrix)

# Compare convergence
from plotting.diagnostics import plot_chain_stats
plot_chain_stats(mice_full.chain_mean, mice_full.chain_var,
                 save_path='full_convergence.png')
plot_chain_stats(mice_custom.chain_mean, mice_custom.chain_var,
                 save_path='custom_convergence.png')

Next Steps

Learn about Convergence Diagnostics to check if your predictor matrix works well
See Best Practices for overall guidance
Try examples in Examples