Predictor Matrices
==================

The predictor matrix controls which variables are used to predict (impute) each 
incomplete variable. This guide explains how predictor matrices work and when to 
customize them.

What is a Predictor Matrix?
----------------------------

A predictor matrix is a square matrix where:

- **Rows** represent variables to be imputed (target variables)
- **Columns** represent predictor variables
- A value of **1** means "use this column to predict this row"
- A value of **0** means "don't use this column to predict this row"
- The **diagonal** is always 0 (a variable doesn't predict itself)

Example
~~~~~~~

.. code-block:: python

   import pandas as pd
   import numpy as np
   
   # Sample predictor matrix for variables: age, income, education
   predictor_matrix = pd.DataFrame(
       [[0, 1, 1],   # To impute age, use income and education
        [1, 0, 1],   # To impute income, use age and education
        [1, 1, 0]],  # To impute education, use age and income
       index=['age', 'income', 'education'],
       columns=['age', 'income', 'education']
   )
   
   print(predictor_matrix)

Output:

.. code-block:: text

            age  income  education
   age        0       1          1
   income     1       0          1
   education  1       1          0

Default Behavior
----------------

If you don't specify a predictor matrix, MICE uses **all other variables** as 
predictors for each incomplete variable:

.. code-block:: python

   mice = MICE(df)
   mice.impute(n_imputations=5)  # Uses default predictor matrix

This is equivalent to:

.. code-block:: python

   # Create full predictor matrix
   predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
   np.fill_diagonal(predictor_matrix.values, 0)
   
   mice.impute(predictor_matrix=predictor_matrix)

When to Customize
-----------------

You should customize the predictor matrix when:

1. **Variables shouldn't predict each other** (logical constraints)
2. **Too many predictors** cause computational issues
3. **Known causal relationships** suggest specific prediction structures
4. **Auxiliary variables** should be used for prediction but not imputed
5. **Multicollinearity** between predictors causes problems

Creating Custom Predictor Matrices
-----------------------------------

Start with Default
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Start with all-ones matrix
   predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
   np.fill_diagonal(predictor_matrix.values, 0)

Then modify as needed.

Exclude Specific Predictors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Prevent one variable from predicting another:

.. code-block:: python

   # Don't use education to predict income
   predictor_matrix.loc['income', 'education'] = 0

Use Only Specific Predictors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Use only age and income to predict education
   predictor_matrix.loc['education', :] = 0  # First, exclude all
   predictor_matrix.loc['education', ['age', 'income']] = 1  # Then include specific

Block of Variables
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Don't use any demographic variables to predict health outcomes
   demographic_vars = ['age', 'gender', 'ethnicity']
   health_vars = ['blood_pressure', 'cholesterol', 'bmi']
   
   for health_var in health_vars:
       predictor_matrix.loc[health_var, demographic_vars] = 0

Common Patterns
---------------

Include Complete Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~

Complete variables (no missing values) can be used as predictors but don't need to 
be imputed:

.. code-block:: python

   # Identify complete variables
   complete_vars = df.columns[df.isnull().sum() == 0].tolist()
   incomplete_vars = df.columns[df.isnull().sum() > 0].tolist()
   
   # Create predictor matrix only for incomplete variables
   predictor_matrix = pd.DataFrame(
       1, 
       index=incomplete_vars,
       columns=df.columns  # Use all variables as predictors
   )
   
   # Set diagonal to 0
   for var in incomplete_vars:
       predictor_matrix.loc[var, var] = 0

Auxiliary Variables
~~~~~~~~~~~~~~~~~~~

Variables that help prediction but aren't part of your analysis model:

.. code-block:: python

   # Suppose 'auxiliary_score' helps predict income but won't be in final model
   # Include it as predictor for income
   predictor_matrix.loc['income', 'auxiliary_score'] = 1
   
   # But don't impute it if missing
   # (Remove from rows if it has missing values you don't care about)

Quickpred: Automatic Predictor Selection
-----------------------------------------

For datasets with many variables, the ``quickpred`` algorithm automatically selects 
predictors based on correlations:

.. code-block:: python

   from imputation.utils import quickpred
   
   # Automatically select predictors
   predictor_matrix = quickpred(
       df,
       mincor=0.1,    # Minimum correlation
       minpuc=0.0,    # Minimum proportion of usable cases
       include=None,  # Variables to always include
       exclude=None   # Variables to always exclude
   )
   
   mice.impute(predictor_matrix=predictor_matrix)

Parameters:

- **mincor**: Only use predictors with absolute correlation >= this threshold
- **minpuc**: Require minimum proportion of usable complete cases
- **include**: List of variables to always include as predictors
- **exclude**: List of variables to never use as predictors

Monotone Patterns
-----------------

If your data has a monotone missing pattern, you can use a block structure:

.. code-block:: python

   # Variables ordered by missingness: time1, time2, time3, time4
   # time2 can only use time1; time3 can use time1-2; etc.
   
   predictor_matrix = pd.DataFrame(0, index=df.columns, columns=df.columns)
   
   predictor_matrix.loc['time2', 'time1'] = 1
   predictor_matrix.loc['time3', ['time1', 'time2']] = 1
   predictor_matrix.loc['time4', ['time1', 'time2', 'time3']] = 1

This respects the temporal structure of the data.

Practical Examples
------------------

Example 1: Exclude Future Predictors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In longitudinal data, future values shouldn't predict past values:

.. code-block:: python

   # Time-ordered variables
   time_vars = ['baseline', 'month3', 'month6', 'month12']
   
   predictor_matrix = pd.DataFrame(1, index=time_vars, columns=time_vars)
   np.fill_diagonal(predictor_matrix.values, 0)
   
   # Exclude future predictors
   for i, target in enumerate(time_vars):
       for predictor in time_vars[i+1:]:
           predictor_matrix.loc[target, predictor] = 0

Example 2: Separate Domains
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Variables from different domains might not predict each other well:

.. code-block:: python

   physical_health = ['height', 'weight', 'blood_pressure']
   mental_health = ['depression_score', 'anxiety_score']
   demographics = ['age', 'gender', 'education']
   
   # Demographics predict everything
   # But physical and mental health don't predict each other
   
   predictor_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
   np.fill_diagonal(predictor_matrix.values, 0)
   
   # Physical doesn't predict mental
   for phys in physical_health:
       for mental in mental_health:
           predictor_matrix.loc[mental, phys] = 0
   
   # Mental doesn't predict physical
   for mental in mental_health:
       for phys in physical_health:
           predictor_matrix.loc[phys, mental] = 0

Example 3: High-Dimensional Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With many variables, use only the most correlated:

.. code-block:: python

   from imputation.utils import quickpred
   
   # Select predictors with correlation >= 0.3
   predictor_matrix = quickpred(df, mincor=0.3)
   
   # Always include key variables
   key_vars = ['age', 'treatment_group']
   for var in df.columns:
       if var not in key_vars:
           for key_var in key_vars:
               predictor_matrix.loc[var, key_var] = 1

Checking Your Predictor Matrix
-------------------------------

Before running MICE, verify your predictor matrix:

.. code-block:: python

   # Check dimensions
   print(f"Shape: {predictor_matrix.shape}")
   
   # Check diagonal is zero
   assert (np.diag(predictor_matrix.values) == 0).all(), "Diagonal should be 0"
   
   # Check each incomplete variable has at least one predictor
   print("Number of predictors per variable:")
   print(predictor_matrix.sum(axis=1))
   
   # Visualize
   import seaborn as sns
   import matplotlib.pyplot as plt
   
   plt.figure(figsize=(10, 8))
   sns.heatmap(predictor_matrix, cmap='RdYlGn', center=0.5, 
               cbar_kws={'label': 'Use as predictor'})
   plt.title('Predictor Matrix')
   plt.tight_layout()
   plt.savefig('predictor_matrix.png')

Common Issues
-------------

Too Few Predictors
~~~~~~~~~~~~~~~~~~

**Problem**: Variables have insufficient predictors, leading to poor imputations.

**Solution**: 
   - Ensure each variable has at least 2-3 relevant predictors
   - Use the default full matrix if unsure

Too Many Predictors
~~~~~~~~~~~~~~~~~~~~

**Problem**: Model fitting is slow or fails due to multicollinearity.

**Solution**:
   - Use ``quickpred()`` to select based on correlations
   - Manually remove redundant predictors
   - Increase ridge parameter in PMM

Circular Dependencies
~~~~~~~~~~~~~~~~~~~~~

**Problem**: Worried about A predicting B and B predicting A.

**Solution**: 
   - This is actually **fine** in MICE! The algorithm handles it iteratively.
   - Only exclude if there's a logical reason (e.g., temporal ordering)

Tips and Best Practices
------------------------

1. **Start simple**: Use the default full matrix first
2. **Be conservative**: Only exclude predictors if you have good reason
3. **Include analysis model variables**: All variables in your final analysis model 
   should be used as predictors
4. **Use auxiliary variables**: Variables that help prediction even if not in your 
   analysis model
5. **Respect time order**: In longitudinal data, don't let future predict past
6. **Check convergence**: Restrictive predictor matrices may slow convergence

Testing Your Choices
---------------------

Compare imputation quality with different predictor matrices:

.. code-block:: python

   # Default: all predictors
   mice_full = MICE(df)
   mice_full.impute(n_imputations=5)
   
   # Custom: restricted predictors
   mice_custom = MICE(df)
   mice_custom.impute(n_imputations=5, predictor_matrix=custom_matrix)
   
   # Compare convergence
   from plotting.diagnostics import plot_chain_stats
   plot_chain_stats(mice_full.chain_mean, mice_full.chain_var, 
                    save_path='full_convergence.png')
   plot_chain_stats(mice_custom.chain_mean, mice_custom.chain_var,
                    save_path='custom_convergence.png')

Next Steps
----------

- Learn about :doc:`convergence_diagnostics` to check if your predictor matrix works well
- See :doc:`best_practices` for overall guidance
- Try examples in :doc:`../examples/index`