Imputation Methods
==================

mice-py provides five different imputation methods. This guide helps you choose and 
configure the right method for your data.

Overview of Methods
-------------------

.. list-table::
   :header-rows: 1
   :widths: 15 20 20 20 25

   * - Method
     - Best For
     - Data Type
     - Preserves
     - Complexity
   * - PMM
     - General purpose
     - Numeric
     - Distribution
     - Low
   * - CART
     - Non-linear
     - Both
     - Interactions
     - Medium
   * - Random Forest
     - Complex patterns
     - Both
     - Interactions
     - High
   * - MIDAS
     - Small samples
     - Numeric
     - Local patterns
     - Low
   * - Sample
     - Quick & simple
     - Both
     - Observed values
     - Very Low

PMM: Predictive Mean Matching
------------------------------

**When to use**: Default choice for numeric data, especially when preserving the 
original distribution is important.

How It Works
~~~~~~~~~~~~

1. Fit a Bayesian linear regression on observed values
2. Generate predictions for both observed and missing values
3. For each missing value, find the *k* closest observed values (donors) based on 
   predicted values
4. Randomly select one donor and use its observed value as the imputed value

**Key advantage**: Imputed values are always from the observed data, so impossible 
values cannot be generated.

Usage
~~~~~

.. code-block:: python

   # Basic usage
   mice.impute(method='pmm')
   
   # With custom parameters
   mice.impute(
       method='pmm',
       pmm_donors=5,         # Number of donor candidates (default: 5)
       pmm_matchtype=1,      # Matching type (0, 1, or 2)
       pmm_ridge=1e-5        # Ridge regularization parameter
   )

Parameters
~~~~~~~~~~

**donors** (default: 5)
   Number of closest donors to consider. Larger values increase variability; smaller 
   values make imputations more deterministic.

**matchtype** (default: 1)
   - 0: Match predicted values (no randomness)
   - 1: Match using drawn parameter values (default, adds uncertainty)
   - 2: Maximum randomness

**ridge** (default: 1e-5)
   Regularization to stabilize estimation with collinear predictors.

When PMM Works Best
~~~~~~~~~~~~~~~~~~~

✓ Numeric data with moderate to large sample size
✓ Preserving distribution properties is important
✓ Data has outliers that should be preserved
✓ MAR mechanism with linear relationships

Limitations
~~~~~~~~~~~

✗ Only generates values already in the data
✗ Assumes approximately linear relationships
✗ May struggle with highly skewed data
✗ Not suitable for categorical variables

CART: Classification and Regression Trees
------------------------------------------

**When to use**: Data with non-linear relationships or interactions between variables.

How It Works
~~~~~~~~~~~~

1. Build a decision tree using complete observations
2. For classification (categorical): predict class probabilities
3. For regression (numeric): predict values
4. Add appropriate random variation to predictions

**Key advantage**: Automatically captures interactions and non-linear patterns without 
needing to specify them.

Usage
~~~~~

.. code-block:: python

   # Basic usage
   mice.impute(method='cart')
   
   # With custom parameters
   mice.impute(
       method='cart',
       cart_max_depth=None,        # Maximum tree depth
       cart_min_samples_split=2,   # Min samples to split
       cart_min_samples_leaf=1     # Min samples in leaf
   )

Parameters
~~~~~~~~~~

**max_depth** (default: None)
   Maximum depth of the tree. ``None`` allows unlimited depth. Use smaller values 
   (e.g., 10-20) to prevent overfitting.

**min_samples_split** (default: 2)
   Minimum samples required to split an internal node.

**min_samples_leaf** (default: 1)
   Minimum samples required at a leaf node.

When CART Works Best
~~~~~~~~~~~~~~~~~~~~

✓ Non-linear relationships
✓ Interaction effects between variables
✓ Mixed data types (numeric and categorical)
✓ Robust to outliers
✓ Categorical variables with many levels

Limitations
~~~~~~~~~~~

✗ Can overfit with small samples
✗ May not preserve distribution as well as PMM
✗ Less stable than other methods (high variance)

Random Forest
-------------

**When to use**: Complex data with many interactions and non-linear relationships.

How It Works
~~~~~~~~~~~~

1. Build an ensemble of decision trees using bootstrap samples
2. Each tree uses a random subset of predictors
3. Average predictions across all trees
4. Add random variation appropriate for the data type

**Key advantage**: More stable and accurate than CART, especially with complex patterns.

Usage
~~~~~

.. code-block:: python

   # Basic usage
   mice.impute(method='rf')
   
   # With custom parameters
   mice.impute(
       method='rf',
       rf_n_estimators=100,     # Number of trees
       rf_max_depth=None,       # Maximum depth
       rf_min_samples_split=2,  # Min samples to split
       rf_max_features='sqrt'   # Features per split
   )

Parameters
~~~~~~~~~~

**n_estimators** (default: 100)
   Number of trees in the forest. More trees = more stable but slower.

**max_depth** (default: None)
   Maximum depth of each tree.

**max_features** (default: 'sqrt')
   Number of features to consider for each split. Options: 'sqrt', 'log2', or an integer.

When Random Forest Works Best
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

✓ Complex, non-linear relationships
✓ Many interaction effects
✓ Large datasets
✓ High-dimensional data
✓ Mixed data types
✓ When accuracy is more important than interpretability

Limitations
~~~~~~~~~~~

✗ Computationally expensive
✗ Slower than other methods
✗ Less interpretable than simpler methods
✗ May not preserve marginal distributions as well as PMM

MIDAS: Multiple Imputation with Distant Average Substitution
-------------------------------------------------------------

**When to use**: Numeric data, especially with small samples or skewed distributions.

How It Works
~~~~~~~~~~~~

1. For each missing value, identify nearby observed values using distance metrics
2. Use a weighted average of distant donors (farther donors get less weight)
3. Add random variation

**Key advantage**: Often performs well with small samples and skewed distributions 
where PMM struggles.

Usage
~~~~~

.. code-block:: python

   # Basic usage
   mice.impute(method='midas')
   
   # With custom parameters
   mice.impute(
       method='midas',
       midas_donors=5,      # Number of donors
       midas_ridge=1e-5     # Ridge parameter
   )

When MIDAS Works Best
~~~~~~~~~~~~~~~~~~~~~

✓ Small sample sizes
✓ Skewed distributions
✓ Numeric data
✓ When PMM struggles with distribution

Limitations
~~~~~~~~~~~

✗ Only for numeric variables
✗ Less commonly used (less validated than PMM/CART/RF)
✗ May require parameter tuning

Sample: Random Sampling
-----------------------

**When to use**: Quick imputations, initial values, or when other methods aren't suitable.

How It Works
~~~~~~~~~~~~

Simply draws random values from the observed values of each variable.

**Key advantage**: Very fast, simple, preserves observed distribution exactly.

Usage
~~~~~

.. code-block:: python

   mice.impute(method='sample')

When Sample Works Best
~~~~~~~~~~~~~~~~~~~~~~

✓ Initial imputation (before MICE iterations)
✓ Categorical variables with many levels
✓ Quick exploratory analysis
✓ When no predictive relationship exists

Limitations
~~~~~~~~~~~

✗ Ignores relationships between variables
✗ No predictive component
✗ Only useful for simple cases or initialization

Choosing a Method
-----------------

Decision Tree
~~~~~~~~~~~~~

.. code-block:: text

   Is your data numeric or categorical?
   │
   ├── Mostly numeric
   │   │
   │   ├── Linear relationships? → PMM
   │   │
   │   ├── Non-linear? → CART or RF
   │   │
   │   └── Small sample or skewed? → MIDAS or PMM
   │
   └── Mixed or mostly categorical
       │
       ├── Simple relationships? → CART
       │
       └── Complex interactions? → RF

General Guidelines
~~~~~~~~~~~~~~~~~~

**Start with PMM**
   It's the most well-studied method and works well in most cases.

**Use CART for interactions**
   If you know or suspect important interactions between variables.

**Use RF for complexity**
   When you have complex patterns and computational resources.

**Use MIDAS when PMM fails**
   Particularly with small samples or skewed data.

**Use Sample for initialization**
   Or for very simple cases.

Using Different Methods for Different Variables
------------------------------------------------

You can use different methods for different variables:

.. code-block:: python

   method_dict = {
       'age': 'pmm',           # Numeric, approximately normal
       'income': 'midas',      # Numeric, highly skewed
       'education': 'sample',  # Categorical with few levels
       'job_type': 'cart',     # Categorical with many levels
       'health_score': 'rf'    # Numeric with complex patterns
   }
   
   mice.impute(n_imputations=5, method=method_dict)

Comparing Methods
-----------------

To compare methods empirically:

.. code-block:: python

   from plotting.diagnostics import densityplot, stripplot
   
   # Try PMM
   mice_pmm = MICE(df)
   mice_pmm.impute(method='pmm')
   
   # Try CART
   mice_cart = MICE(df)
   mice_cart.impute(method='cart')
   
   # Compare distributions
   missing_pattern = df.notna().astype(int)
   densityplot(mice_pmm.imputed_datasets, missing_pattern, 
               save_path='pmm_density.png')
   densityplot(mice_cart.imputed_datasets, missing_pattern,
               save_path='cart_density.png')

Look for:
   - How well imputed values match observed distribution
   - Whether extreme values are reasonable
   - Smooth transitions between observed and imputed

Research Findings
-----------------

Based on simulation studies in the thesis:

- **PMM** performs reliably under MCAR and mild MAR with symmetric distributions
- **MIDAS** consistently matches or outperforms PMM with skewness or small samples
- **CART/RF** handle non-linear relationships effectively but may not preserve 
  marginal distributions as well
- Method choice should consider data characteristics, missingness patterns, and sample size

Next Steps
----------

- Learn about :doc:`predictor_matrices` to control which variables predict which
- Check :doc:`convergence_diagnostics` after imputation
- See practical examples in :doc:`../examples/index`