MICE Class
The main class for performing multiple imputation by chained equations.
- class imputation.MICE(data)[source]
Bases:
objectMultiple Imputation by Chained Equations (MICE) class.
This class implements the MICE algorithm for handling missing data through multiple imputations using chained equations.
- Parameters:
data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.
- data
The validated and cleaned input data
- Type:
pd.DataFrame
- __init__(data)[source]
Initialize the MICE object.
- Parameters:
data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.
- Raises:
ValueError – If data is not a pandas DataFrame or contains duplicate column names
- impute(n_imputations=5, maxit=10, predictor_matrix=None, initial='sample', method=None, visit_sequence='monotone', **kwargs)[source]
Perform multiple imputation by chained equations.
- Parameters:
n_imputations (int, default=5) – Number of imputations to perform
maxit (int, default=10) – Maximum number of iterations for each imputation cycle. Must be a positive integer.
predictor_matrix (pd.DataFrame, optional) – Binary matrix indicating which variables should be used as predictors for each target variable. Should have column names as both index and columns. A 1 indicates that the column variable is used as predictor for the index variable. If None, a predictor matrix is estimated using _quickpred.
initial (str, default=DEFAULT_INITIAL_METHOD) – Initial imputation method. Must be one of SUPPORTED_INITIAL_METHODS.
method (Union[str, Dict[str, str]], optional) – Imputation method(s) to use: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods - None: use default method for all columns Must be one of SUPPORTED_METHODS.
visit_sequence (Union[str, List[str]], default="monotone") – Sequence in which variables should be visited during imputation: - str: “monotone” for monotone missing data pattern - List[str]: list of column names specifying the order to visit variables
**kwargs (dict) –
Additional keyword arguments. - output_dir (str, optional): Directory to save outputs for this run.
If not provided, a timestamped folder is created in output_figures.
Parameters for specific imputation methods can also be passed. These should be prefixed with the method name and an underscore, e.g., pmm_donors=5 to pass donors=5 to the pmm imputer.
When predictor_matrix is not specified, the following can be passed for _quickpred: - min_cor (float, default=0.1): Minimum correlation for a predictor. - min_puc (float, default=0.0): Minimum proportion of usable cases. - include (list, optional): Columns to always include as predictors. - exclude (list, optional): Columns to always exclude as predictors. - correlation_method (str, default=”pearson”): Correlation method used to
compute the correlation matrix inside _quickpred.
- fit(formula)[source]
Fit a statistical model to each imputed dataset using the specified formula.
This method fits the specified statistical model to each dataset in self.imputed_datasets and stores the results in self.model_results.
- Parameters:
formula (str) – A formula string in patsy syntax for statsmodels (e.g., ‘y ~ x1 + x2’)
- Raises:
ValueError – If no imputed datasets are available or if variables in formula are not in data
Examples
>>> mice_obj = MICE(data) >>> mice_obj.impute(n_imputations=5) >>> mice_obj.fit('outcome ~ predictor1 + predictor2')
- pool(summ=False)[source]
Pool parameter estimates from fitted models using Rubin’s rules.
This method combines parameter estimates and their uncertainties from multiple imputed datasets according to Rubin’s (1987) rules for multiple imputation inference.
- Parameters:
summ (bool, default=False) – If True, returns a summary of the pooled results
- Returns:
If summ=False, returns a MICEresult object containing pooled estimates. If summ=True, returns a summary table of the pooled results.
- Return type:
MICEresult or summary
- Raises:
ValueError – If no model results are available from analysis
Notes
Rubin’s pooling rules combine: - Point estimates: average across imputations - Within-imputation variance: average of individual model variances - Between-imputation variance: variance of point estimates across imputations - Total variance: within + (1 + 1/m) * between - Fraction of missing information (FMI): proportion of uncertainty due to missingness
References
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
Overview
The MICE class is the primary interface for multiple imputation in mice-py.
It handles the entire imputation process, from initialization through to analysis
and pooling.
Basic Usage
from imputation import MICE
import pandas as pd
# Load data with missing values
df = pd.read_csv('data.csv')
# Initialize MICE object
mice = MICE(df)
# Perform imputation
mice.impute(
n_imputations=5,
maxit=10,
method='pmm'
)
# Access imputed datasets
imputed_datasets = mice.imputed_datasets
# Fit a statistical model
mice.fit('outcome ~ predictor1 + predictor2')
# Pool results
pooled = mice.pool(summ=True)
print(pooled)
Main Methods
__init__(data)
Initialize a MICE object with your data.
- Parameters:
data (pandas.DataFrame): Input data with missing values
- Raises:
ValueError: If data is not a DataFrame or has duplicate column names
impute()
Perform multiple imputation.
- Parameters:
n_imputations (int): Number of imputed datasets (default: 5)
maxit (int): Number of iterations (default: 10)
method (str or dict): Imputation method(s) (default: ‘pmm’)
initial (str): Initial imputation method (default: ‘sample’)
predictor_matrix (DataFrame, optional): Custom predictor matrix
visit_sequence (str or list): Variable visit order (default: ‘monotone’)
seed (int, optional): Random seed for reproducibility
Additional method-specific parameters (see below)
- Method-specific parameters:
PMM:
pmm_donors,pmm_matchtype,pmm_ridgeCART:
cart_max_depth,cart_min_samples_split,cart_min_samples_leafRF:
rf_n_estimators,rf_max_depth,rf_max_featuresMIDAS:
midas_donors,midas_ridge
- Returns:
None (modifies object in-place)
- Raises:
ValueError: If parameters are invalid
fit(formula)
Fit a statistical model on all imputed datasets.
- Parameters:
formula (str): Model formula in Patsy syntax (e.g., ‘y ~ x1 + x2’)
- Returns:
None (stores results internally)
Example:
# Simple regression
mice.fit('income ~ age + education')
# With interaction
mice.fit('income ~ age * education')
# Multiple predictors
mice.fit('outcome ~ x1 + x2 + x3 + C(categorical_var)')
pool(summ=True)
Pool results from multiple imputed datasets using Rubin’s rules.
- Parameters:
summ (bool): Return summary (True) or detailed results (False)
- Returns:
pandas.DataFrame: Pooled results with columns: - Estimate: Pooled coefficient - Std.Error: Pooled standard error - t-statistic: Test statistic - df: Degrees of freedom - P>|t|: p-value - [0.025]: Lower 95% CI bound - 0.975]: Upper 95% CI bound - FMI: Fraction of missing information
Example:
results = mice.pool(summ=True)
print(results)
# Access specific values
coef = results.loc['age', 'Estimate']
pval = results.loc['age', 'P>|t|']
fmi = results.loc['age', 'FMI']
Attributes
data
The original input data (pandas.DataFrame).
imputed_datasets
List of imputed datasets (list of pandas.DataFrames). Available after calling impute().
chain_mean
Dictionary mapping variable names to mean chains across iterations. Used for convergence diagnostics.
chain_var
Dictionary mapping variable names to variance chains across iterations. Used for convergence diagnostics.
id_obs
Dictionary mapping variable names to boolean arrays indicating observed values.
id_mis
Dictionary mapping variable names to boolean arrays indicating missing values.
Examples
Basic Imputation
from imputation import MICE
import pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50],
'income': [50000, np.nan, 60000, 75000, np.nan],
'education': ['HS', 'BS', 'MS', np.nan, 'PhD']
})
# Impute
mice = MICE(df)
mice.impute(n_imputations=5, maxit=10, method='pmm')
# Check results
print(f"Created {len(mice.imputed_datasets)} complete datasets")
Custom Methods Per Variable
method_dict = {
'age': 'pmm',
'income': 'cart',
'education': 'sample'
}
mice.impute(n_imputations=10, method=method_dict)
Custom Predictor Matrix
import numpy as np
# Create predictor matrix
pred_matrix = pd.DataFrame(1, index=df.columns, columns=df.columns)
np.fill_diagonal(pred_matrix.values, 0)
# Don't use education to predict income
pred_matrix.loc['income', 'education'] = 0
mice.impute(predictor_matrix=pred_matrix)
With Method-Specific Parameters
# PMM with more donors
mice.impute(method='pmm', pmm_donors=10)
# CART with depth limit
mice.impute(method='cart', cart_max_depth=15)
# Random Forest with more trees
mice.impute(method='rf', rf_n_estimators=200)
Complete Analysis Workflow
from imputation import MICE, configure_logging
from plotting.diagnostics import plot_chain_stats
# Enable logging
configure_logging(level='INFO')
# Load data
df = pd.read_csv('data.csv')
# Impute
mice = MICE(df)
mice.impute(n_imputations=20, maxit=20, method='pmm')
# Check convergence
plot_chain_stats(mice.chain_mean, mice.chain_var,
save_path='convergence.png')
# Fit model
mice.fit('outcome ~ age + gender + treatment')
# Pool results
results = mice.pool(summ=True)
print(results)
# Check FMI
print(f"\nMax FMI: {results['FMI'].max():.3f}")
See Also
Imputation Methods for imputation method details
Pooling Functions for pooling functions
MICE Overview for conceptual overview
Examples for more examples