Imputation Methods

This page documents the five imputation methods available in mice-py.

PMM: Predictive Mean Matching

imputation.PMM.pmm(y, id_obs, x, id_mis=None, donors=5, matchtype=1, quantify=True, ridge=1e-05, matcher='NN', rng=None, **kwargs)[source]

Predictive Mean Matching (PMM) imputation.

This function imputes missing values in a variable y using predictive mean matching. The method is based on Rubin’s (1987) Bayesian linear regression and mimics the behavior of the R mice package’s PMM imputation method.

Parameters:
  • y (array-like (1D), shape (n_samples,)) – Target variable to be imputed. Can be numeric or categorical.

  • id_obs (array-like of bool, shape (n_samples,)) – Logical array indicating which elements of y are observed (True) or missing (False).

  • x (array-like (2D), shape (n_samples, n_features)) – Numeric design matrix of predictors. Must have no missing values.

  • id_mis (array-like of bool, shape (n_samples,), optional) – Logical array indicating which values should be imputed. If None, id_mis is set to the complement of id_obs.

  • donors (int, default=5) – Number of donors to draw from the observed cases when imputing missing values.

  • matchtype (int, default=1) – Type of matching: - 0: Predicted value of y_obs vs predicted value of y_mis - 1: Predicted value of y_obs vs drawn value of y_mis (default) - 2: Drawn value of y_obs vs drawn value of y_mis

  • quantify (bool, default=True) – If True and y is categorical, factor levels are replaced by the first canonical variate (via CCA). If False, categorical values are replaced by integer codes (less accurate).

  • ridge (float, default=1e-5) – Ridge regularization parameter used in norm_draw() to stabilize estimation. Increase for multicollinear data, decrease to reduce bias.

  • matcher (str, default="NN") – Matching method. Currently only “NN” (nearest neighbor) is supported.

  • **kwargs (dict) – Additional arguments passed to norm_draw(), such as ls_meth.

Returns:

y_imp – Imputed values for missing positions only (matching R implementation). Returns object array if y was categorical, else float array.

Return type:

np.ndarray

Notes

Based on: - Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. - Van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice R package.

Examples

>>> y = np.array([7, np.nan, 9, 10, 11])
>>> id_obs = ~np.isnan(y)
>>> x = np.array([[1, 2], [3, 4], [5, 7], [7, 8], [9, 10]])
>>> pmm(y=y, id_obs=id_obs, x=x, donors=3)
imputation.PMM.quantify_cca(y, id_obs, x)[source]

Factorize a categorical variable y into numeric values via optimal scaling using Canonical Correlation Analysis (CCA) with predictors x.

Parameters:
  • y (array-like, categorical variable with missing values)

  • id_obs (boolean array-like, mask indicating observed (True) and missing (False) in y)

  • x (array-like or DataFrame, predictors without missing values corresponding to y)

Returns:

  • ynum (numpy.ndarray) – Numeric transformation of y with missing positions as np.nan.

  • id (pandas.DataFrame) – DataFrame representing the canonical components for the observed y.

Notes

This method encodes y as one-hot vectors, then applies CCA to find numeric representations that maximize correlation with predictors x.

imputation.PMM.matcherid(d, t, matcher='NN', k=10, radius=3, rng=None)[source]

Find donor indices matching missing values based on specified matching method.

Parameters:
  • d (np.array) – Numeric vector of observed values (donor pool).

  • t (np.array) – Numeric vector of missing values to be matched.

  • matcher (str, optional) – Matching method to use: - “NN”: Randomly selects one from the k nearest neighbors (default). - “fixedNN”: Randomly selects one donor within a fixed radius.

  • k (int, optional) – Number of nearest neighbors to consider (only for “NN” matcher).

  • radius (float, optional) – Radius threshold for fixedNN matcher (only for “fixedNN” matcher).

  • rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.

Returns:

List of indices corresponding to chosen donors in d for each element in t.

Return type:

list of int

Raises:

ValueError – If an unknown matcher method is specified.

Examples

>>> d = np.array([-5, 6, 0, 10, 12])
>>> t = np.array([-6])
>>> matcherid(d, t, matcher="NN", k=3)
[0]
>>> matcherid(d, t, matcher="fixedNN", radius=5)
[0]

CART: Classification and Regression Trees

imputation.cart.cart(y, id_obs, x, id_mis=None, min_samples_leaf=5, ccp_alpha=0.0001, rng=None, **kwargs)[source]

Impute missing values using Classification and Regression Trees (CART).

This function is designed to be compatible with the MICE framework.

Parameters:
  • y (Union[pd.Series, np.ndarray]) – Target variable with missing values

  • id_obs (np.ndarray) – Boolean mask of observed values in y (True for observed, False for missing)

  • x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (must be fully observed)

  • id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs

  • min_samples_leaf (int, default=5) – Minimum number of samples required to be at a leaf node

  • ccp_alpha (float, default=1e-4) – Complexity parameter for pruning

  • rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.

  • **kwargs (dict) – Additional parameters passed to the tree model

Returns:

Imputed values for missing positions only (matching R implementation).

Return type:

np.ndarray

Notes

The procedure follows R’s mice CART implementation: 1. Bootstrap the observed cases (sample with replacement) 2. Fit a classification or regression tree on the bootstrap sample 3. For each missing value, find the terminal node it would end up in 4. Make a random draw from the ORIGINAL observed values in that node

This adds stochasticity through both bootstrapping and donor sampling.

Random Forest

imputation.rf.rf(y, id_obs, x, id_mis=None, n_estimators=10, rng=None, **kwargs)[source]

Impute missing values using Random Forests with donor sampling.

This function is designed to be compatible with the MICE framework, following the same interface as PMM, midas, CART, and sample methods.

Parameters:
  • y (Union[pd.Series, np.ndarray]) – Target variable with missing values

  • id_obs (np.ndarray) – Boolean mask of observed values in y (True = observed, False = missing)

  • x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (should be the current completed columns)

  • id_mis (np.ndarray, optional) – Boolean mask of missing values. If None, uses ~id_obs.

  • n_estimators (int, default=10) – Number of trees in the forest

  • rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.

  • **kwargs (dict) – Additional parameters passed to the random forest model.

Returns:

Imputed values for missing positions only.

Return type:

np.ndarray

Notes

Algorithm (Doove et al., 2014; mirrors R mice): 1. Fit a random forest on observed data. 2. For each missing case, find terminal nodes across all trees. 3. For each tree, collect donors (observed cases in same node). 4. Randomly sample one donor per tree. 5. Take final imputation as a random draw from those donor predictions.

Bootstrapping is inherent to Random Forest (bagging), so no additional bootstrap is applied (matching R mice behavior). Each tree is already built on a bootstrap sample of the data.

MIDAS: Distance Aided Substitution

imputation.midas.bootfunc_plain(n)[source]

Generates bootstrap weights for n observations using simple random sampling with replacement.

This function simulates a nonparametric bootstrap by randomly drawing n integers from the range 1 to n (inclusive), with replacement. It returns the count of how many times each index (1-based) is selected, producing a frequency table that can be used as weights in e.g. MIDAS imputation.

Parameters:

n (int) – The number of observations to sample and also the length of the resulting weight vector.

Returns:

weights – An array of integers indicating how often each index (1-based) was selected in the bootstrap sample.

Return type:

ndarray of shape (n,)

imputation.midas.minmax(x, domin=True, domax=True)[source]
imputation.midas.compute_beta(x, m)[source]
imputation.midas.midas(y, id_obs, x, id_mis=None, ridge=1e-05, midas_kappa=None, outout=True, **kwargs)[source]

MIDAS Imputation: Multiple Imputation with Distant Average Substitution.

This function implements the MIDAS imputation algorithm for continuous variables, as introduced by Gaffert et al. (2018).

It operates by weighting observed donors based on the similarity between predicted values, with optional leave-one-out model estimation for increased fidelity.

Parameters:
  • y (array-like of shape (n_samples,)) – The target variable with missing values to be imputed. Must be numeric.

  • id_obs (array-like of bool of shape (n_samples,)) – Logical array indicating observed values in y. True where y is observed, False where missing.

  • x (array-like of shape (n_samples, n_features)) – Design matrix of predictor variables. Must be fully observed.

  • id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs.

  • ridge (float, default=1e-5) – Ridge penalty used in regularized regression to stabilize the solution in the presence of multicollinearity. - Set lower (e.g. 1e-6) to reduce bias in noisy data. - Set higher (e.g. 1e-4) if collinearity is suspected.

  • midas_kappa (float or None, default=None) – Controls the sharpness of donor weighting. If None, the optimal value is estimated based on R² as described by Siddique and Belin (2008). A common fallback is 3.

  • outout (bool, default=True) – If True, uses leave-one-out regression for each donor (slow but MI-proper). If False, a single model is estimated for all donors and recipients. WARNING: Setting outout=False may produce biased estimates and is not fully supported.

  • **kwargs (dict) – Additional arguments (not used in this method).

Returns:

y_imp – Imputed values for missing positions only (matching R implementation).

Return type:

np.ndarray

Notes

  • Based on: Gaffert, P., Meinfelder, F., & van den Bosch, V. (2018). “Towards an MI-proper Predictive Mean Matching.”

  • Related: Siddique, J. & Belin, T. R. (2008). “Multiple Imputation Using an Iterative Hot-Deck with Distance-Based Donor Selection.”

Examples

>>> y = np.array([7, np.nan, 9, 10, 11])
>>> id_obs = ~np.isnan(y)
>>> x = np.array([[1, 2], [3, 4], [5, 6], [7, 13], [11, 10]])
>>> midas(y, id_obs, x)
array([9.0])

Sample: Random Sampling

imputation.sample.sample(y, id_obs, x, id_mis=None, rng=None, **kwargs)[source]

Impute missing values by random sampling from observed values.

This function is designed to be compatible with the MICE framework, following the same interface as PMM, midas, and CART imputation methods.

Parameters:
  • y (Union[pd.Series, np.ndarray]) – Target variable with missing values

  • id_obs (np.ndarray) – Boolean mask of observed values in y (True for observed, False for missing)

  • x (Union[pd.DataFrame, np.ndarray]) – Predictor variables (not used in this method, but kept for consistency)

  • id_mis (np.ndarray, optional) – Boolean mask of missing values to impute. If None, uses ~id_obs

  • rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.

  • **kwargs (dict) – Additional arguments (not used in this method)

Returns:

Imputed values for missing positions only (matching R implementation).

Return type:

np.ndarray

Notes

This is the simplest imputation method that: 1. Takes all observed values in the target variable 2. Randomly samples from them to fill in missing values 3. No modeling is involved, just random sampling with replacement

This method ignores the predictor variables (x) and only uses the observed values of the target variable for imputation.

Edge cases handled (matching R implementation): - If no observed values: returns random normal values for numeric data,

None values for categorical data

  • If only one observed value: duplicates it to allow sampling

See Also