MICE Class

The main class for performing multiple imputation by chained equations.

class imputation.MICE(data)[source]

Bases: object

Multiple Imputation by Chained Equations (MICE) class.

This class implements the MICE algorithm for handling missing data through multiple imputations using chained equations.

Parameters:

data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.

data

The validated and cleaned input data

Type:

pd.DataFrame

id_obs

Dictionary mapping column names to indices of observed values

Type:

Dict[str, np.ndarray]

id_mis

Dictionary mapping column names to indices of missing values

Type:

Dict[str, np.ndarray]

__init__(data)[source]

Initialize the MICE object.

Parameters:

data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.

Raises:

ValueError – If data is not a pandas DataFrame or contains duplicate column names

impute(n_imputations=5, maxit=10, predictor_matrix=None, initial='sample', method=None, visit_sequence='monotone', **kwargs)[source]

Perform multiple imputation by chained equations.

Parameters:
  • n_imputations (int, default=5) – Number of imputations to perform

  • maxit (int, default=10) – Maximum number of iterations for each imputation cycle. Must be a positive integer.

  • predictor_matrix (pd.DataFrame, optional) – Binary matrix indicating which variables should be used as predictors for each target variable. Should have column names as both index and columns. A 1 indicates that the column variable is used as predictor for the index variable. If None, a predictor matrix is estimated using _quickpred.

  • initial (str, default=DEFAULT_INITIAL_METHOD) – Initial imputation method. Must be one of SUPPORTED_INITIAL_METHODS.

  • method (Union[str, Dict[str, str]], optional) – Imputation method(s) to use: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods - None: use default method for all columns Must be one of SUPPORTED_METHODS.

  • visit_sequence (Union[str, List[str]], default="monotone") – Sequence in which variables should be visited during imputation: - str: “monotone” for monotone missing data pattern - List[str]: list of column names specifying the order to visit variables

  • **kwargs (dict) –

    Additional keyword arguments. - output_dir (str, optional): Directory to save outputs for this run.

    If not provided, a timestamped folder is created in output_figures.

    Parameters for specific imputation methods can also be passed. These should be prefixed with the method name and an underscore, e.g., pmm_donors=5 to pass donors=5 to the pmm imputer.

    When predictor_matrix is not specified, the following can be passed for _quickpred: - min_cor (float, default=0.1): Minimum correlation for a predictor. - min_puc (float, default=0.0): Minimum proportion of usable cases. - include (list, optional): Columns to always include as predictors. - exclude (list, optional): Columns to always exclude as predictors. - correlation_method (str, default=”pearson”): Correlation method used to

    compute the correlation matrix inside _quickpred.

fit(formula)[source]

Fit a statistical model to each imputed dataset using the specified formula.

This method fits the specified statistical model to each dataset in self.imputed_datasets and stores the results in self.model_results.

Parameters:

formula (str) – A formula string in patsy syntax for statsmodels (e.g., ‘y ~ x1 + x2’)

Raises:

ValueError – If no imputed datasets are available or if variables in formula are not in data

Examples

>>> mice_obj = MICE(data)
>>> mice_obj.impute(n_imputations=5)
>>> mice_obj.fit('outcome ~ predictor1 + predictor2')
pool(summ=False)[source]

Pool parameter estimates from fitted models using Rubin’s rules.

This method combines parameter estimates and their uncertainties from multiple imputed datasets according to Rubin’s (1987) rules for multiple imputation inference.

Parameters:

summ (bool, default=False) – If True, returns a summary of the pooled results

Returns:

If summ=False, returns a MICEresult object containing pooled estimates. If summ=True, returns a summary table of the pooled results.

Return type:

MICEresult or summary

Raises:

ValueError – If no model results are available from analysis

Notes

Rubin’s pooling rules combine: - Point estimates: average across imputations - Within-imputation variance: average of individual model variances - Between-imputation variance: variance of point estimates across imputations - Total variance: within + (1 + 1/m) * between - Fraction of missing information (FMI): proportion of uncertainty due to missingness

References

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.

Overview

The MICE class is the primary interface for multiple imputation in mice-py. It handles the entire imputation process, from initialization through to analysis and pooling.

Basic Usage

from imputation import MICE
import pandas as pd

# Load data with missing values
df = pd.read_csv('data.csv')

# Initialize MICE object
mice = MICE(df)

# Perform imputation
mice.impute(
    n_imputations=5,
    maxit=10,
    method='pmm'
)

# Access imputed datasets
imputed_datasets = mice.imputed_datasets

# Fit a statistical model
mice.fit('outcome ~ predictor1 + predictor2')

# Pool results
pooled = mice.pool(summ=True)
print(pooled)

Main Methods

__init__(data)

Initialize a MICE object with your data.

Parameters:
  • data (pandas.DataFrame): Input data with missing values

Raises:
  • ValueError: If data is not a DataFrame or has duplicate column names

impute()

Perform multiple imputation.

Parameters:
  • n_imputations (int): Number of imputed datasets (default: 5)

  • maxit (int): Number of iterations (default: 10)

  • method (str or dict): Imputation method(s) (default: ‘pmm’)

  • initial (str): Initial imputation method (default: ‘sample’)

  • predictor_matrix (DataFrame, optional): Custom predictor matrix

  • visit_sequence (str or list): Variable visit order (default: ‘monotone’)

  • seed (int, optional): Random seed for reproducibility

  • Additional method-specific parameters (see below)

Method-specific parameters:
  • PMM: pmm_donors, pmm_matchtype, pmm_ridge

  • CART: cart_max_depth, cart_min_samples_split, cart_min_samples_leaf

  • RF: rf_n_estimators, rf_max_depth, rf_max_features

  • MIDAS: midas_donors, midas_ridge

Returns:
  • None (modifies object in-place)

Raises:
  • ValueError: If parameters are invalid

fit(formula)

Fit a statistical model on all imputed datasets.

Parameters:
  • formula (str): Model formula in Patsy syntax (e.g., ‘y ~ x1 + x2’)

Returns:
  • None (stores results internally)

Example:

# Simple regression
mice.fit('income ~ age + education')

# With interaction
mice.fit('income ~ age * education')

# Multiple predictors
mice.fit('outcome ~ x1 + x2 + x3 + C(categorical_var)')

pool(summ=True)

Pool results from multiple imputed datasets using Rubin’s rules.

Parameters:
  • summ (bool): Return summary (True) or detailed results (False)

Returns:
  • pandas.DataFrame: Pooled results with columns: - Estimate: Pooled coefficient - Std.Error: Pooled standard error - t-statistic: Test statistic - df: Degrees of freedom - P>|t|: p-value - [0.025]: Lower 95% CI bound - 0.975]: Upper 95% CI bound - FMI: Fraction of missing information

Example:

results = mice.pool(summ=True)
print(results)

# Access specific values
coef = results.loc['age', 'Estimate']
pval = results.loc['age', 'P>|t|']
fmi = results.loc['age', 'FMI']

Attributes

data

The original input data (pandas.DataFrame).

imputed_datasets

List of imputed datasets (list of pandas.DataFrames). Available after calling impute().

chain_mean

Dictionary mapping variable names to mean chains across iterations. Used for convergence diagnostics.

chain_var

Dictionary mapping variable names to variance chains across iterations. Used for convergence diagnostics.

id_obs

Dictionary mapping variable names to boolean arrays indicating observed values.

id_mis

Dictionary mapping variable names to boolean arrays indicating missing values.

Examples

Basic Imputation

from imputation import MICE
import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50],
    'income': [50000, np.nan, 60000, 75000, np.nan],
    'education': ['HS', 'BS', 'MS', np.nan, 'PhD']
})

# Impute
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10, method='pmm')

# Check results
print(f"Created {len(mice.imputed_datasets)} complete datasets")

Custom Methods Per Variable

method_dict = {
    'age': 'pmm',
    'income': 'cart',
    'education': 'sample'
}

mice.impute(n_imputations=10, method=method_dict)

Custom Predictor Matrix

import numpy as np

# Create predictor matrix
pred_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(pred_matrix.values, 0)

# Don't use education to predict income
pred_matrix.loc['income', 'education'] = 0

mice.impute(predictor_matrix=pred_matrix)

With Method-Specific Parameters

# PMM with more donors
mice.impute(method='pmm', pmm_donors=10)

# CART with depth limit
mice.impute(method='cart', cart_max_depth=15)

# Random Forest with more trees
mice.impute(method='rf', rf_n_estimators=200)

Complete Analysis Workflow

from imputation import MICE, configure_logging
from plotting.diagnostics import plot_chain_stats

# Enable logging
configure_logging(level='INFO')

# Load data
df = pd.read_csv('data.csv')

# Impute
mice = MICE(df)
mice.impute(n_imputations=20, maxit=20, method='pmm')

# Check convergence
plot_chain_stats(mice.chain_mean, mice.chain_var,
                 save_path='convergence.png')

# Fit model
mice.fit('outcome ~ age + gender + treatment')

# Pool results
results = mice.pool(summ=True)
print(results)

# Check FMI
print(f"\nMax FMI: {results['FMI'].max():.3f}")

See Also