Pooling Functions

Functions for combining results from multiple imputed datasets using Rubin’s rules.

MICEresult Class

Pooled results container for MICE following Rubin’s rules.

Separated into its own module so it can be reused and keeps MICE.py lighter.

class imputation.mice_result.MICEresult(model, params, normalized_cov_params)[source]

Bases: LikelihoodModelResults

Holds pooled parameter estimates after multiple imputations.

__init__(model, params, normalized_cov_params)[source]
summary(title=None, alpha=0.05)[source]

Return a statsmodels summary object with an FMI column.

Pooling Module

Standalone pooling module for multiple imputation results.

This module provides functions to pool descriptive statistics and model estimates from multiple imputed datasets using Rubin’s rules, without requiring coupling to any specific imputation framework.

class imputation.pooling.PoolingResult(estimates, variances, within_variance, between_variance, frac_miss_info, param_names, n_imputations, sample_size)[source]

Bases: object

Container for pooled multiple imputation results.

estimates

Pooled parameter estimates (q_bar)

Type:

np.ndarray

variances

Total variances for each parameter (t)

Type:

np.ndarray

within_variance

Average within-imputation variance (u_bar)

Type:

np.ndarray

between_variance

Between-imputation variance (b)

Type:

np.ndarray

frac_miss_info

Fraction of missing information for each parameter

Type:

np.ndarray

param_names

Names of the pooled parameters

Type:

List[str]

n_imputations

Number of imputations used

Type:

int

sample_size

Sample size of each imputed dataset

Type:

int

estimates: ndarray
variances: ndarray
within_variance: ndarray
between_variance: ndarray
frac_miss_info: ndarray
param_names: List[str]
n_imputations: int
sample_size: int
summary()[source]

Return a summary DataFrame with pooled statistics.

Returns:

Summary table with estimates, standard errors, and diagnostics

Return type:

pd.DataFrame

__init__(estimates, variances, within_variance, between_variance, frac_miss_info, param_names, n_imputations, sample_size)
imputation.pooling.validate_imputed_datasets(datasets)[source]

Validate that the input datasets are suitable for pooling.

Parameters:

datasets (List[pd.DataFrame]) – List of imputed datasets to validate

Raises:

ValueError – If datasets are invalid for pooling

imputation.pooling.apply_rubins_rules(estimates, variances)[source]

Apply Rubin’s rules to combine estimates and variances across imputations.

Parameters:
  • estimates (np.ndarray) – Array of shape (n_imputations, n_parameters) with parameter estimates

  • variances (np.ndarray) – Array of shape (n_imputations, n_parameters) with within-imputation variances

Returns:

(pooled_estimates, total_variances, within_variance, between_variance)

Return type:

tuple

imputation.pooling.pool_descriptive_statistics(datasets, include_numeric=True, include_categorical=True)[source]

Pool descriptive statistics across multiple imputed datasets using Rubin’s rules.

For numeric columns, pools the sample mean and its variance. For categorical columns, pools the per-level proportions and their variances.

Parameters:
  • datasets (List[pd.DataFrame]) – List of complete imputed datasets. All datasets must have the same shape and column names.

  • include_numeric (bool, default=True) – Whether to include numeric columns in pooling

  • include_categorical (bool, default=True) – Whether to include categorical columns in pooling

Returns:

Object containing pooled estimates, variances, and diagnostic statistics

Return type:

PoolingResult

Raises:

ValueError – If datasets are invalid or no columns are available for pooling

imputation.pooling.pool_from_files(file_paths, read_kwargs=None, **pooling_kwargs)[source]

Pool descriptive statistics from datasets stored in files.

Parameters:
  • file_paths (List[str]) – List of file paths to imputed datasets

  • read_kwargs (dict, optional) – Keyword arguments to pass to pd.read_csv()

  • **pooling_kwargs – Additional arguments to pass to pool_descriptive_statistics()

Returns:

Pooled results from the datasets

Return type:

PoolingResult

imputation.pooling.pool_subset(datasets, columns=None, **pooling_kwargs)[source]

Pool descriptive statistics for a subset of columns.

Parameters:
  • datasets (List[pd.DataFrame]) – List of complete imputed datasets

  • columns (List[str], optional) – List of column names to include in pooling. If None, uses all columns.

  • **pooling_kwargs – Additional arguments to pass to pool_descriptive_statistics()

Returns:

Pooled results for the specified columns

Return type:

PoolingResult

See Also