MICE Class

The main class for performing multiple imputation by chained equations.

class imputation.MICE(data)[source]

Bases: object

Multiple Imputation by Chained Equations (MICE) class.

This class implements the MICE algorithm for handling missing data through multiple imputations using chained equations.

Parameters:: data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.

data

The validated and cleaned input data

Type:: pd.DataFrame

id_obs

Dictionary mapping column names to indices of observed values

Type:: Dict[str, np.ndarray]

id_mis

Dictionary mapping column names to indices of missing values

Type:: Dict[str, np.ndarray]

__init__(data)[source]

Initialize the MICE object.

Parameters:: data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.
Raises:: ValueError – If data is not a pandas DataFrame or contains duplicate column names

impute(n_imputations=5, maxit=10, predictor_matrix=None, initial='sample', method=None, visit_sequence='monotone', **kwargs)[source]

Perform multiple imputation by chained equations.

Parameters:

n_imputations (int, default=5) – Number of imputations to perform
maxit (int, default=10) – Maximum number of iterations for each imputation cycle. Must be a positive integer.
predictor_matrix (pd.DataFrame, optional) – Binary matrix indicating which variables should be used as predictors for each target variable. Should have column names as both index and columns. A 1 indicates that the column variable is used as predictor for the index variable. If None, a predictor matrix is estimated using _quickpred.
initial (str, default=DEFAULT_INITIAL_METHOD) – Initial imputation method. Must be one of SUPPORTED_INITIAL_METHODS.
method (Union[str, Dict[str, str]], optional) – Imputation method(s) to use: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods - None: use default method for all columns Must be one of SUPPORTED_METHODS.
visit_sequence (Union[str, List[str]], default="monotone") – Sequence in which variables should be visited during imputation: - str: “monotone” for monotone missing data pattern - List[str]: list of column names specifying the order to visit variables
**kwargs (dict) –
Additional keyword arguments. - output_dir (str, optional): Directory to save outputs for this run.

If not provided, a timestamped folder is created in output_figures.

Parameters for specific imputation methods can also be passed. These should be prefixed with the method name and an underscore, e.g., pmm_donors=5 to pass donors=5 to the pmm imputer.

When predictor_matrix is not specified, the following can be passed for _quickpred: - min_cor (float, default=0.1): Minimum correlation for a predictor. - min_puc (float, default=0.0): Minimum proportion of usable cases. - include (list, optional): Columns to always include as predictors. - exclude (list, optional): Columns to always exclude as predictors. - correlation_method (str, default=”pearson”): Correlation method used to

compute the correlation matrix inside _quickpred.

fit(formula)[source]

Fit a statistical model to each imputed dataset using the specified formula.

This method fits the specified statistical model to each dataset in self.imputed_datasets and stores the results in self.model_results.

Parameters:: formula (str) – A formula string in patsy syntax for statsmodels (e.g., ‘y ~ x1 + x2’)
Raises:: ValueError – If no imputed datasets are available or if variables in formula are not in data

Examples

>>> mice_obj = MICE(data)
>>> mice_obj.impute(n_imputations=5)
>>> mice_obj.fit('outcome ~ predictor1 + predictor2')

pool(summ=False)[source]

Pool parameter estimates from fitted models using Rubin’s rules.

This method combines parameter estimates and their uncertainties from multiple imputed datasets according to Rubin’s (1987) rules for multiple imputation inference.

Parameters:: summ (bool, default=False) – If True, returns a summary of the pooled results
Returns:: If summ=False, returns a MICEresult object containing pooled estimates. If summ=True, returns a summary table of the pooled results.
Return type:: MICEresult or summary
Raises:: ValueError – If no model results are available from analysis

Notes

Rubin’s pooling rules combine: - Point estimates: average across imputations - Within-imputation variance: average of individual model variances - Between-imputation variance: variance of point estimates across imputations - Total variance: within + (1 + 1/m) * between - Fraction of missing information (FMI): proportion of uncertainty due to missingness

References

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.

Overview

The MICE class is the primary interface for multiple imputation in mice-py. It handles the entire imputation process, from initialization through to analysis and pooling.

Basic Usage

from imputation import MICE
import pandas as pd

# Load data with missing values
df = pd.read_csv('data.csv')

# Initialize MICE object
mice = MICE(df)

# Perform imputation
mice.impute(
    n_imputations=5,
    maxit=10,
    method='pmm'
)

# Access imputed datasets
imputed_datasets = mice.imputed_datasets

# Fit a statistical model
mice.fit('outcome ~ predictor1 + predictor2')

# Pool results
pooled = mice.pool(summ=True)
print(pooled)

Main Methods

init(data)

Initialize a MICE object with your data.

Parameters:

data (pandas.DataFrame): Input data with missing values

Raises:

ValueError: If data is not a DataFrame or has duplicate column names

impute()

Perform multiple imputation.

Parameters:

n_imputations (int): Number of imputed datasets (default: 5)
maxit (int): Number of iterations (default: 10)
method (str or dict): Imputation method(s) (default: ‘pmm’)
initial (str): Initial imputation method (default: ‘sample’)
predictor_matrix (DataFrame, optional): Custom predictor matrix
visit_sequence (str or list): Variable visit order (default: ‘monotone’)
seed (int, optional): Random seed for reproducibility
Additional method-specific parameters (see below)

Method-specific parameters:

PMM: pmm_donors, pmm_matchtype, pmm_ridge
CART: cart_max_depth, cart_min_samples_split, cart_min_samples_leaf
RF: rf_n_estimators, rf_max_depth, rf_max_features
MIDAS: midas_donors, midas_ridge

Returns:

None (modifies object in-place)

Raises:

ValueError: If parameters are invalid

fit(formula)

Fit a statistical model on all imputed datasets.

Parameters:

formula (str): Model formula in Patsy syntax (e.g., ‘y ~ x1 + x2’)

Returns:

None (stores results internally)

Example:

# Simple regression
mice.fit('income ~ age + education')

# With interaction
mice.fit('income ~ age * education')

# Multiple predictors
mice.fit('outcome ~ x1 + x2 + x3 + C(categorical_var)')

pool(summ=True)

Pool results from multiple imputed datasets using Rubin’s rules.

Parameters:

summ (bool): Return summary (True) or detailed results (False)

Returns:

pandas.DataFrame: Pooled results with columns: - Estimate: Pooled coefficient - Std.Error: Pooled standard error - t-statistic: Test statistic - df: Degrees of freedom - P>|t|: p-value - [0.025]: Lower 95% CI bound - 0.975]: Upper 95% CI bound - FMI: Fraction of missing information

Example:

results = mice.pool(summ=True)
print(results)

# Access specific values
coef = results.loc['age', 'Estimate']
pval = results.loc['age', 'P>|t|']
fmi = results.loc['age', 'FMI']

Attributes

data

The original input data (pandas.DataFrame).

imputed_datasets

List of imputed datasets (list of pandas.DataFrames). Available after calling impute().

chain_mean

Dictionary mapping variable names to mean chains across iterations. Used for convergence diagnostics.

chain_var

Dictionary mapping variable names to variance chains across iterations. Used for convergence diagnostics.

id_obs

Dictionary mapping variable names to boolean arrays indicating observed values.

id_mis

Dictionary mapping variable names to boolean arrays indicating missing values.

Examples

Basic Imputation

from imputation import MICE
import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50],
    'income': [50000, np.nan, 60000, 75000, np.nan],
    'education': ['HS', 'BS', 'MS', np.nan, 'PhD']
})

# Impute
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10, method='pmm')

# Check results
print(f"Created {len(mice.imputed_datasets)} complete datasets")

Custom Methods Per Variable

method_dict = {
    'age': 'pmm',
    'income': 'cart',
    'education': 'sample'
}

mice.impute(n_imputations=10, method=method_dict)

Custom Predictor Matrix

import numpy as np

# Create predictor matrix
pred_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(pred_matrix.values, 0)

# Don't use education to predict income
pred_matrix.loc['income', 'education'] = 0

mice.impute(predictor_matrix=pred_matrix)

With Method-Specific Parameters

# PMM with more donors
mice.impute(method='pmm', pmm_donors=10)

# CART with depth limit
mice.impute(method='cart', cart_max_depth=15)

# Random Forest with more trees
mice.impute(method='rf', rf_n_estimators=200)

Complete Analysis Workflow

from imputation import MICE, configure_logging
from plotting.diagnostics import plot_chain_stats

# Enable logging
configure_logging(level='INFO')

# Load data
df = pd.read_csv('data.csv')

# Impute
mice = MICE(df)
mice.impute(n_imputations=20, maxit=20, method='pmm')

# Check convergence
plot_chain_stats(mice.chain_mean, mice.chain_var,
                 save_path='convergence.png')

# Fit model
mice.fit('outcome ~ age + gender + treatment')

# Pool results
results = mice.pool(summ=True)
print(results)

# Check FMI
print(f"\nMax FMI: {results['FMI'].max():.3f}")

MICE Class

Overview

Basic Usage

Main Methods

__init__(data)

impute()

fit(formula)

pool(summ=True)

Attributes

data

imputed_datasets

chain_mean

chain_var

id_obs

id_mis

Examples

Basic Imputation

Custom Methods Per Variable

Custom Predictor Matrix

With Method-Specific Parameters

Complete Analysis Workflow

See Also

init(data)