Imputations
This package provides robust imputation methods including MICE, PMM, CART, and Random Forest for handling missing data in statistical analysis.
Main Class
MICE (Multiple Imputation by Chained Equations)
- class imputation.MICE.MICE(data)[source]
Bases:
objectMultiple Imputation by Chained Equations (MICE) class.
This class implements the MICE algorithm for handling missing data through multiple imputations using chained equations.
- Parameters:
data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.
- data
The validated and cleaned input data
- Type:
pd.DataFrame
- __init__(data)[source]
Initialize the MICE object.
- Parameters:
data (pd.DataFrame) – Input data with missing values. Must be a pandas DataFrame.
- Raises:
ValueError – If data is not a pandas DataFrame or contains duplicate column names
- impute(n_imputations=5, maxit=10, predictor_matrix=None, initial='sample', method=None, visit_sequence='monotone', **kwargs)[source]
Perform multiple imputation by chained equations.
- Parameters:
n_imputations (int, default=5) – Number of imputations to perform
maxit (int, default=10) – Maximum number of iterations for each imputation cycle. Must be a positive integer.
predictor_matrix (pd.DataFrame, optional) – Binary matrix indicating which variables should be used as predictors for each target variable. Should have column names as both index and columns. A 1 indicates that the column variable is used as predictor for the index variable. If None, a predictor matrix is estimated using _quickpred.
initial (str, default=DEFAULT_INITIAL_METHOD) – Initial imputation method. Must be one of SUPPORTED_INITIAL_METHODS.
method (Union[str, Dict[str, str]], optional) – Imputation method(s) to use: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods - None: use default method for all columns Must be one of SUPPORTED_METHODS.
visit_sequence (Union[str, List[str]], default="monotone") – Sequence in which variables should be visited during imputation: - str: “monotone” for monotone missing data pattern - List[str]: list of column names specifying the order to visit variables
**kwargs (dict) –
Additional keyword arguments. - output_dir (str, optional): Directory to save outputs for this run.
If not provided, a timestamped folder is created in output_figures.
Parameters for specific imputation methods can also be passed. These should be prefixed with the method name and an underscore, e.g., pmm_donors=5 to pass donors=5 to the pmm imputer.
When predictor_matrix is not specified, the following can be passed for _quickpred: - min_cor (float, default=0.1): Minimum correlation for a predictor. - min_puc (float, default=0.0): Minimum proportion of usable cases. - include (list, optional): Columns to always include as predictors. - exclude (list, optional): Columns to always exclude as predictors. - correlation_method (str, default=”pearson”): Correlation method used to
compute the correlation matrix inside _quickpred.
- fit(formula)[source]
Fit a statistical model to each imputed dataset using the specified formula.
This method fits the specified statistical model to each dataset in self.imputed_datasets and stores the results in self.model_results.
- Parameters:
formula (str) – A formula string in patsy syntax for statsmodels (e.g., ‘y ~ x1 + x2’)
- Raises:
ValueError – If no imputed datasets are available or if variables in formula are not in data
Examples
>>> mice_obj = MICE(data) >>> mice_obj.impute(n_imputations=5) >>> mice_obj.fit('outcome ~ predictor1 + predictor2')
- pool(summ=False)[source]
Pool parameter estimates from fitted models using Rubin’s rules.
This method combines parameter estimates and their uncertainties from multiple imputed datasets according to Rubin’s (1987) rules for multiple imputation inference.
- Parameters:
summ (bool, default=False) – If True, returns a summary of the pooled results
- Returns:
If summ=False, returns a MICEresult object containing pooled estimates. If summ=True, returns a summary table of the pooled results.
- Return type:
MICEresult or summary
- Raises:
ValueError – If no model results are available from analysis
Notes
Rubin’s pooling rules combine: - Point estimates: average across imputations - Within-imputation variance: average of individual model variances - Between-imputation variance: variance of point estimates across imputations - Total variance: within + (1 + 1/m) * between - Fraction of missing information (FMI): proportion of uncertainty due to missingness
References
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
Imputation Methods
CART (Classification and Regression Trees)
- imputation.cart.cart(y, id_obs, x, id_mis=None, min_samples_leaf=5, ccp_alpha=0.0001, rng=None, **kwargs)[source]
Impute missing values using Classification and Regression Trees (CART).
This function is designed to be compatible with the MICE framework.
- Parameters:
y (Union[pd.Series, np.ndarray]) – Target variable with missing values
id_obs (np.ndarray) – Boolean mask of observed values in y (True for observed, False for missing)
x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (must be fully observed)
id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs
min_samples_leaf (int, default=5) – Minimum number of samples required to be at a leaf node
ccp_alpha (float, default=1e-4) – Complexity parameter for pruning
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
**kwargs (dict) – Additional parameters passed to the tree model
- Returns:
Imputed values for missing positions only (matching R implementation).
- Return type:
np.ndarray
Notes
The procedure follows R’s mice CART implementation: 1. Bootstrap the observed cases (sample with replacement) 2. Fit a classification or regression tree on the bootstrap sample 3. For each missing value, find the terminal node it would end up in 4. Make a random draw from the ORIGINAL observed values in that node
This adds stochasticity through both bootstrapping and donor sampling.
Random Forest
- imputation.rf.rf(y, id_obs, x, id_mis=None, n_estimators=10, rng=None, **kwargs)[source]
Impute missing values using Random Forests with donor sampling.
This function is designed to be compatible with the MICE framework, following the same interface as PMM, midas, CART, and sample methods.
- Parameters:
y (Union[pd.Series, np.ndarray]) – Target variable with missing values
id_obs (np.ndarray) – Boolean mask of observed values in y (True = observed, False = missing)
x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (should be the current completed columns)
id_mis (np.ndarray, optional) – Boolean mask of missing values. If None, uses ~id_obs.
n_estimators (int, default=10) – Number of trees in the forest
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
**kwargs (dict) – Additional parameters passed to the random forest model.
- Returns:
Imputed values for missing positions only.
- Return type:
np.ndarray
Notes
Algorithm (Doove et al., 2014; mirrors R mice): 1. Fit a random forest on observed data. 2. For each missing case, find terminal nodes across all trees. 3. For each tree, collect donors (observed cases in same node). 4. Randomly sample one donor per tree. 5. Take final imputation as a random draw from those donor predictions.
Bootstrapping is inherent to Random Forest (bagging), so no additional bootstrap is applied (matching R mice behavior). Each tree is already built on a bootstrap sample of the data.
Sample Method
- imputation.sample.sample(y, id_obs, x, id_mis=None, rng=None, **kwargs)[source]
Impute missing values by random sampling from observed values.
This function is designed to be compatible with the MICE framework, following the same interface as PMM, midas, and CART imputation methods.
- Parameters:
y (Union[pd.Series, np.ndarray]) – Target variable with missing values
id_obs (np.ndarray) – Boolean mask of observed values in y (True for observed, False for missing)
x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (not used in this method, but kept for consistency)
id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
**kwargs (dict) – Additional arguments (not used in this method)
- Returns:
Imputed values for missing positions only (matching R implementation).
- Return type:
np.ndarray
Notes
This is the simplest imputation method that: 1. Takes all observed values in the target variable 2. Randomly samples from them to fill in missing values 3. No modeling is involved, just random sampling with replacement
This method ignores the predictor variables (x) and only uses the observed values of the target variable for imputation.
Edge cases handled (matching R implementation): - If no observed values: returns random normal values for numeric data,
None values for categorical data
If only one observed value: duplicates it to allow sampling
PMM (Predictive Mean Matching)
- imputation.PMM.pmm(y, id_obs, x, id_mis=None, donors=5, matchtype=1, quantify=True, ridge=1e-05, matcher='NN', rng=None, **kwargs)[source]
Predictive Mean Matching (PMM) imputation.
This function imputes missing values in a variable y using predictive mean matching. The method is based on Rubin’s (1987) Bayesian linear regression and mimics the behavior of the R mice package’s PMM imputation method.
- Parameters:
y (array-like (1D), shape (n_samples,)) – Target variable to be imputed. Can be numeric or categorical.
id_obs (array-like of bool, shape (n_samples,)) – Logical array indicating which elements of y are observed (True) or missing (False).
x (array-like (2D), shape (n_samples, n_features)) – Numeric design matrix of predictors. Must have no missing values.
id_mis (array-like of bool, shape (n_samples,), optional) – Logical array indicating which values should be imputed. If None, id_mis is set to the complement of id_obs.
donors (int, default=5) – Number of donors to draw from the observed cases when imputing missing values.
matchtype (int, default=1) – Type of matching: - 0: Predicted value of y_obs vs predicted value of y_mis - 1: Predicted value of y_obs vs drawn value of y_mis (default) - 2: Drawn value of y_obs vs drawn value of y_mis
quantify (bool, default=True) – If True and y is categorical, factor levels are replaced by the first canonical variate (via CCA). If False, categorical values are replaced by integer codes (less accurate).
ridge (float, default=1e-5) – Ridge regularization parameter used in norm_draw() to stabilize estimation. Increase for multicollinear data, decrease to reduce bias.
matcher (str, default="NN") – Matching method. Currently only “NN” (nearest neighbor) is supported.
**kwargs (dict) – Additional arguments passed to norm_draw(), such as ls_meth.
- Returns:
y_imp – Imputed values for missing positions only (matching R implementation). Returns object array if y was categorical, else float array.
- Return type:
np.ndarray
Notes
Based on: - Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. - Van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice R package.
Examples
>>> y = np.array([7, np.nan, 9, 10, 11]) >>> id_obs = ~np.isnan(y) >>> x = np.array([[1, 2], [3, 4], [5, 7], [7, 8], [9, 10]]) >>> pmm(y=y, id_obs=id_obs, x=x, donors=3)
- imputation.PMM.quantify_cca(y, id_obs, x)[source]
Factorize a categorical variable y into numeric values via optimal scaling using Canonical Correlation Analysis (CCA) with predictors x.
- Parameters:
y (array-like, categorical variable with missing values)
id_obs (boolean array-like, mask indicating observed (True) and missing (False) in y)
x (array-like or DataFrame, predictors without missing values corresponding to y)
- Returns:
ynum (numpy.ndarray) – Numeric transformation of y with missing positions as np.nan.
id (pandas.DataFrame) – DataFrame representing the canonical components for the observed y.
Notes
This method encodes y as one-hot vectors, then applies CCA to find numeric representations that maximize correlation with predictors x.
- imputation.PMM.matcherid(d, t, matcher='NN', k=10, radius=3, rng=None)[source]
Find donor indices matching missing values based on specified matching method.
- Parameters:
d (np.array) – Numeric vector of observed values (donor pool).
t (np.array) – Numeric vector of missing values to be matched.
matcher (str, optional) – Matching method to use: - “NN”: Randomly selects one from the k nearest neighbors (default). - “fixedNN”: Randomly selects one donor within a fixed radius.
k (int, optional) – Number of nearest neighbors to consider (only for “NN” matcher).
radius (float, optional) – Radius threshold for fixedNN matcher (only for “fixedNN” matcher).
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
- Returns:
List of indices corresponding to chosen donors in d for each element in t.
- Return type:
- Raises:
ValueError – If an unknown matcher method is specified.
Examples
>>> d = np.array([-5, 6, 0, 10, 12]) >>> t = np.array([-6]) >>> matcherid(d, t, matcher="NN", k=3) [0] >>> matcherid(d, t, matcher="fixedNN", radius=5) [0]
MIDAS (Multiple Imputation by Distance Aided Donor Selection)
- imputation.midas.bootfunc_plain(n)[source]
Generates bootstrap weights for n observations using simple random sampling with replacement.
This function simulates a nonparametric bootstrap by randomly drawing n integers from the range 1 to n (inclusive), with replacement. It returns the count of how many times each index (1-based) is selected, producing a frequency table that can be used as weights in e.g. MIDAS imputation.
- Parameters:
n (int) – The number of observations to sample and also the length of the resulting weight vector.
- Returns:
weights – An array of integers indicating how often each index (1-based) was selected in the bootstrap sample.
- Return type:
ndarray of shape (n,)
- imputation.midas.midas(y, id_obs, x, id_mis=None, ridge=1e-05, midas_kappa=None, outout=True, **kwargs)[source]
MIDAS Imputation: Multiple Imputation with Distant Average Substitution.
This function implements the MIDAS imputation algorithm for continuous variables, as introduced by Gaffert et al. (2018).
It operates by weighting observed donors based on the similarity between predicted values, with optional leave-one-out model estimation for increased fidelity.
- Parameters:
y (array-like of shape (n_samples,)) – The target variable with missing values to be imputed. Must be numeric.
id_obs (array-like of bool of shape (n_samples,)) – Logical array indicating observed values in y. True where y is observed, False where missing.
x (array-like of shape (n_samples, n_features)) – Design matrix of predictor variables. Must be fully observed.
id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs.
ridge (float, default=1e-5) – Ridge penalty used in regularized regression to stabilize the solution in the presence of multicollinearity. - Set lower (e.g. 1e-6) to reduce bias in noisy data. - Set higher (e.g. 1e-4) if collinearity is suspected.
midas_kappa (float or None, default=None) – Controls the sharpness of donor weighting. If None, the optimal value is estimated based on R² as described by Siddique and Belin (2008). A common fallback is 3.
outout (bool, default=True) – If True, uses leave-one-out regression for each donor (slow but MI-proper). If False, a single model is estimated for all donors and recipients. WARNING: Setting outout=False may produce biased estimates and is not fully supported.
**kwargs (dict) – Additional arguments (not used in this method).
- Returns:
y_imp – Imputed values for missing positions only (matching R implementation).
- Return type:
np.ndarray
Notes
Based on: Gaffert, P., Meinfelder, F., & van den Bosch, V. (2018). “Towards an MI-proper Predictive Mean Matching.”
Related: Siddique, J. & Belin, T. R. (2008). “Multiple Imputation Using an Iterative Hot-Deck with Distance-Based Donor Selection.”
Examples
>>> y = np.array([7, np.nan, 9, 10, 11]) >>> id_obs = ~np.isnan(y) >>> x = np.array([[1, 2], [3, 4], [5, 6], [7, 13], [11, 10]]) >>> midas(y, id_obs, x) array([9.0])
Utilities and Support
Result Classes
Pooled results container for MICE following Rubin’s rules.
Separated into its own module so it can be reused and keeps MICE.py lighter.
Pooling Functions
Standalone pooling module for multiple imputation results.
This module provides functions to pool descriptive statistics and model estimates from multiple imputed datasets using Rubin’s rules, without requiring coupling to any specific imputation framework.
- class imputation.pooling.PoolingResult(estimates, variances, within_variance, between_variance, frac_miss_info, param_names, n_imputations, sample_size)[source]
Bases:
objectContainer for pooled multiple imputation results.
- estimates
Pooled parameter estimates (q_bar)
- Type:
np.ndarray
- variances
Total variances for each parameter (t)
- Type:
np.ndarray
- within_variance
Average within-imputation variance (u_bar)
- Type:
np.ndarray
- between_variance
Between-imputation variance (b)
- Type:
np.ndarray
- frac_miss_info
Fraction of missing information for each parameter
- Type:
np.ndarray
- summary()[source]
Return a summary DataFrame with pooled statistics.
- Returns:
Summary table with estimates, standard errors, and diagnostics
- Return type:
pd.DataFrame
- __init__(estimates, variances, within_variance, between_variance, frac_miss_info, param_names, n_imputations, sample_size)
- imputation.pooling.validate_imputed_datasets(datasets)[source]
Validate that the input datasets are suitable for pooling.
- Parameters:
datasets (List[pd.DataFrame]) – List of imputed datasets to validate
- Raises:
ValueError – If datasets are invalid for pooling
- imputation.pooling.apply_rubins_rules(estimates, variances)[source]
Apply Rubin’s rules to combine estimates and variances across imputations.
- Parameters:
estimates (np.ndarray) – Array of shape (n_imputations, n_parameters) with parameter estimates
variances (np.ndarray) – Array of shape (n_imputations, n_parameters) with within-imputation variances
- Returns:
(pooled_estimates, total_variances, within_variance, between_variance)
- Return type:
- imputation.pooling.pool_descriptive_statistics(datasets, include_numeric=True, include_categorical=True)[source]
Pool descriptive statistics across multiple imputed datasets using Rubin’s rules.
For numeric columns, pools the sample mean and its variance. For categorical columns, pools the per-level proportions and their variances.
- Parameters:
datasets (List[pd.DataFrame]) – List of complete imputed datasets. All datasets must have the same shape and column names.
include_numeric (bool, default=True) – Whether to include numeric columns in pooling
include_categorical (bool, default=True) – Whether to include categorical columns in pooling
- Returns:
Object containing pooled estimates, variances, and diagnostic statistics
- Return type:
- Raises:
ValueError – If datasets are invalid or no columns are available for pooling
- imputation.pooling.pool_from_files(file_paths, read_kwargs=None, **pooling_kwargs)[source]
Pool descriptive statistics from datasets stored in files.
- Parameters:
- Returns:
Pooled results from the datasets
- Return type:
- imputation.pooling.pool_subset(datasets, columns=None, **pooling_kwargs)[source]
Pool descriptive statistics for a subset of columns.
- Parameters:
datasets (List[pd.DataFrame]) – List of complete imputed datasets
columns (List[str], optional) – List of column names to include in pooling. If None, uses all columns.
**pooling_kwargs – Additional arguments to pass to pool_descriptive_statistics()
- Returns:
Pooled results for the specified columns
- Return type:
Validation Functions
- imputation.validators.check_n_imputations(n_imputations)[source]
Check if the number of imputations is valid and provide a warning if it’s high.
- Parameters:
n_imputations (int) – Number of imputations to perform
- Raises:
ValueError – If n_imputations is not a positive integer
- imputation.validators.check_maxit(maxit)[source]
Check if the maximum iterations parameter is valid and provide a warning if it’s high.
- Parameters:
maxit (int) – Maximum number of iterations for each imputation cycle
- Raises:
ValueError – If maxit is not a positive integer
- imputation.validators.check_method(method, columns)[source]
Check and process the method parameter for MICE imputation.
- Parameters:
- Returns:
Dictionary mapping each column to its imputation method
- Return type:
- Raises:
ValueError – If method is invalid or references non-existent columns
- imputation.validators.check_initial_method(initial_method)[source]
Check if the initial imputation method is valid.
- Parameters:
initial_method (str) – Initial imputation method to validate
- Raises:
ValueError – If initial_method is not a valid initial imputation method
- imputation.validators.check_visit_sequence(visit_sequence, columns, columns_with_missing=None)[source]
Check and process the visit sequence parameter for MICE imputation.
- Parameters:
visit_sequence (Union[str, List[str]]) – Visit sequence specification. Can be: - str: “monotone” or “random” for predefined sequences - List[str]: list of column names specifying the order to visit variables
columns (List[str]) – List of all column names in the data
columns_with_missing (List[str], optional) – List of columns that have missing values. If provided, will validate that all these columns are included in a custom visit sequence.
- Returns:
(validated_sequence, columns_without_missing) where: - validated_sequence: the processed visit sequence (only for list input, None for string) - columns_without_missing: list of columns in sequence that don’t have missing values
- Return type:
- Raises:
ValueError – If visit_sequence is invalid or references non-existent columns
Notes
For string visit sequences (“monotone”, “random”), the actual sequence will be generated in MICE._set_visit_sequence() based on the data.
For list visit sequences, this function validates that: 1. All columns exist in the data 2. No duplicate columns 3. All columns with missing values are included (if columns_with_missing provided)
- imputation.validators.validate_predictor_matrix(predictor_matrix, data_columns, data)[source]
Validate predictor matrix for MICE imputation.
- Parameters:
predictor_matrix (pd.DataFrame) – Binary matrix indicating which variables should be used as predictors for each target variable. Rows represent target variables, columns represent predictors. A 1 indicates that the column variable is used as predictor for the index variable.
data_columns (List[str]) – List of column names in the data to validate against
data (pd.DataFrame) – The data to check for missing values
- Returns:
Validated predictor matrix
- Return type:
pd.DataFrame
- Raises:
ValueError – If predictor_matrix has invalid structure or column names don’t match data
- imputation.validators.validate_columns(data)[source]
Validate and clean columns in the DataFrame.
Checks for columns with only NaN values and drops them with appropriate warnings.
- Parameters:
data (pd.DataFrame) – Input DataFrame to validate
- Returns:
DataFrame with invalid columns removed
- Return type:
pd.DataFrame
- Warns:
UserWarning – If columns with only NaN values are found and dropped
Notes
Missing data values that are treated as NaN: - pandas NaN (numpy.nan)
- imputation.validators.validate_dataframe(data)[source]
Check and validate input data for MICE imputation.
- Parameters:
data (Any) – Input data to be checked and converted to DataFrame
- Returns:
Validated and cleaned DataFrame
- Return type:
pd.DataFrame
- Raises:
ValueError – If data cannot be converted to DataFrame or has duplicate column names
Notes
Missing data values that are treated as NaN: - pandas NaN (numpy.nan)
- imputation.validators.validate_formula(formula, columns)[source]
Validate that all variables in the formula exist in the dataset columns.
- Parameters:
- Raises:
ValueError – If any variables in the formula are not found in the columns
Configuration
Logging Configuration
Logging configuration module for the imputation package.
This module provides proper logging setup following Python best practices, allowing users to configure logging behavior without affecting the global logging state.
- imputation.logging_config.setup_logging(level='INFO', log_dir=None, console=True, file_logging=True, format_string=None, console_level=None, file_level=None, max_bytes=5242880, backup_count=5)[source]
Configure logging for the imputation package.
This function sets up a package-specific logger without affecting the root logger or other packages. It’s safe to call multiple times.
- Parameters:
level (str or int, default="INFO") – Base logging level for the package (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’) or logging level constant (e.g., logging.DEBUG)
log_dir (str, optional) – Directory for log files. If None, uses ‘./logs’
console (bool, default=True) – Whether to enable console logging
file_logging (bool, default=True) – Whether to enable file logging
format_string (str, optional) – Custom format string for log messages. If None, uses default format
console_level (str or int, optional) – Logging level for console handler. If None, uses ‘INFO’
file_level (str or int, optional) – Logging level for file handler. If None, uses ‘DEBUG’
max_bytes (int, default=5MB) – Maximum size of log file before rotation
backup_count (int, default=5) – Number of backup log files to keep
- Returns:
The configured package logger
- Return type:
Examples
Basic usage: >>> import imputation >>> logger = imputation.setup_logging()
Custom configuration: >>> logger = imputation.setup_logging( … level=’DEBUG’, … log_dir=’my_logs’, … console_level=’WARNING’ … )
Disable file logging: >>> logger = imputation.setup_logging(file_logging=False)
- imputation.logging_config.get_logger(name)[source]
Get a logger for a specific module within the imputation package.
This function returns a child logger of the main package logger, ensuring proper hierarchy and inheritance of configuration.
- Parameters:
name (str) – Module name, typically __name__ or a descriptive string
- Returns:
A logger instance for the specified module
- Return type:
Examples
In a module file: >>> from imputation.logging_config import get_logger >>> logger = get_logger(__name__) >>> logger.info(“This is a log message”)
For simulation scripts: >>> logger = get_logger(‘imputation.simulation.fdgs’)