Utilities

Helper functions for configuration, validation, and predictor selection.

Logging Configuration

Logging configuration module for the imputation package.

This module provides proper logging setup following Python best practices, allowing users to configure logging behavior without affecting the global logging state.

imputation.logging_config.setup_logging(level='INFO', log_dir=None, console=True, file_logging=True, format_string=None, console_level=None, file_level=None, max_bytes=5242880, backup_count=5)[source]

Configure logging for the imputation package.

This function sets up a package-specific logger without affecting the root logger or other packages. It’s safe to call multiple times.

Parameters:
  • level (str or int, default="INFO") – Base logging level for the package (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’) or logging level constant (e.g., logging.DEBUG)

  • log_dir (str, optional) – Directory for log files. If None, uses ‘./logs’

  • console (bool, default=True) – Whether to enable console logging

  • file_logging (bool, default=True) – Whether to enable file logging

  • format_string (str, optional) – Custom format string for log messages. If None, uses default format

  • console_level (str or int, optional) – Logging level for console handler. If None, uses ‘INFO’

  • file_level (str or int, optional) – Logging level for file handler. If None, uses ‘DEBUG’

  • max_bytes (int, default=5MB) – Maximum size of log file before rotation

  • backup_count (int, default=5) – Number of backup log files to keep

Returns:

The configured package logger

Return type:

logging.Logger

Examples

Basic usage: >>> import imputation >>> logger = imputation.setup_logging()

Custom configuration: >>> logger = imputation.setup_logging( … level=’DEBUG’, … log_dir=’my_logs’, … console_level=’WARNING’ … )

Disable file logging: >>> logger = imputation.setup_logging(file_logging=False)

imputation.logging_config.get_logger(name)[source]

Get a logger for a specific module within the imputation package.

This function returns a child logger of the main package logger, ensuring proper hierarchy and inheritance of configuration.

Parameters:

name (str) – Module name, typically __name__ or a descriptive string

Returns:

A logger instance for the specified module

Return type:

logging.Logger

Examples

In a module file: >>> from imputation.logging_config import get_logger >>> logger = get_logger(__name__) >>> logger.info(“This is a log message”)

For simulation scripts: >>> logger = get_logger(‘imputation.simulation.fdgs’)

imputation.logging_config.disable_logging()[source]

Disable logging for the imputation package.

This is useful for testing or when logging output is not desired.

imputation.logging_config.reset_logging()[source]

Reset logging configuration to default state.

This removes all handlers and sets the logger back to default configuration with only a NullHandler.

Validation Functions

imputation.validators.check_n_imputations(n_imputations)[source]

Check if the number of imputations is valid and provide a warning if it’s high.

Parameters:

n_imputations (int) – Number of imputations to perform

Raises:

ValueError – If n_imputations is not a positive integer

imputation.validators.check_maxit(maxit)[source]

Check if the maximum iterations parameter is valid and provide a warning if it’s high.

Parameters:

maxit (int) – Maximum number of iterations for each imputation cycle

Raises:

ValueError – If maxit is not a positive integer

imputation.validators.check_method(method, columns)[source]

Check and process the method parameter for MICE imputation.

Parameters:
  • method (Union[str, Dict[str, str]]) – Method specification. Can be: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods

  • columns (List[str]) – List of column names in the data

Returns:

Dictionary mapping each column to its imputation method

Return type:

Dict[str, str]

Raises:

ValueError – If method is invalid or references non-existent columns

imputation.validators.check_initial_method(initial_method)[source]

Check if the initial imputation method is valid.

Parameters:

initial_method (str) – Initial imputation method to validate

Raises:

ValueError – If initial_method is not a valid initial imputation method

imputation.validators.check_visit_sequence(visit_sequence, columns, columns_with_missing=None)[source]

Check and process the visit sequence parameter for MICE imputation.

Parameters:
  • visit_sequence (Union[str, List[str]]) – Visit sequence specification. Can be: - str: “monotone” or “random” for predefined sequences - List[str]: list of column names specifying the order to visit variables

  • columns (List[str]) – List of all column names in the data

  • columns_with_missing (List[str], optional) – List of columns that have missing values. If provided, will validate that all these columns are included in a custom visit sequence.

Returns:

(validated_sequence, columns_without_missing) where: - validated_sequence: the processed visit sequence (only for list input, None for string) - columns_without_missing: list of columns in sequence that don’t have missing values

Return type:

tuple

Raises:

ValueError – If visit_sequence is invalid or references non-existent columns

Notes

For string visit sequences (“monotone”, “random”), the actual sequence will be generated in MICE._set_visit_sequence() based on the data.

For list visit sequences, this function validates that: 1. All columns exist in the data 2. No duplicate columns 3. All columns with missing values are included (if columns_with_missing provided)

imputation.validators.validate_predictor_matrix(predictor_matrix, data_columns, data)[source]

Validate predictor matrix for MICE imputation.

Parameters:
  • predictor_matrix (pd.DataFrame) – Binary matrix indicating which variables should be used as predictors for each target variable. Rows represent target variables, columns represent predictors. A 1 indicates that the column variable is used as predictor for the index variable.

  • data_columns (List[str]) – List of column names in the data to validate against

  • data (pd.DataFrame) – The data to check for missing values

Returns:

Validated predictor matrix

Return type:

pd.DataFrame

Raises:

ValueError – If predictor_matrix has invalid structure or column names don’t match data

imputation.validators.validate_columns(data)[source]

Validate and clean columns in the DataFrame.

Checks for columns with only NaN values and drops them with appropriate warnings.

Parameters:

data (pd.DataFrame) – Input DataFrame to validate

Returns:

DataFrame with invalid columns removed

Return type:

pd.DataFrame

Warns:

UserWarning – If columns with only NaN values are found and dropped

Notes

Missing data values that are treated as NaN: - pandas NaN (numpy.nan)

imputation.validators.validate_dataframe(data)[source]

Check and validate input data for MICE imputation.

Parameters:

data (Any) – Input data to be checked and converted to DataFrame

Returns:

Validated and cleaned DataFrame

Return type:

pd.DataFrame

Raises:

ValueError – If data cannot be converted to DataFrame or has duplicate column names

Notes

Missing data values that are treated as NaN: - pandas NaN (numpy.nan)

imputation.validators.validate_formula(formula, columns)[source]

Validate that all variables in the formula exist in the dataset columns.

Parameters:
  • formula (str) – The formula string to validate

  • columns (List[str]) – List of column names in the dataset

Raises:

ValueError – If any variables in the formula are not found in the columns

Utility Functions

imputation.utils.get_imputer_func(method_name)[source]

Return the imputer callable for method_name.

Parameters:

method_name (str) – Name of the imputation method. Must be one of the values defined in imputation.constants.ImputationMethod.

Returns:

The function implementing the requested imputation strategy.

Return type:

Callable

Raises:

ValueError – If method_name is unknown or not yet implemented.

Sampler Functions

imputation.sampler.sym(x)[source]

Ensures the input square matrix is symmetric by averaging it with its transpose.

Parameters:

x (np.ndarray) – A square numpy matrix.

Returns:

A symmetric matrix computed as (x + x.T) / 2.

Return type:

np.ndarray

imputation.sampler.norm_draw(y, ry, x, rank_adjust=True, rng=None, **kwargs)[source]

Bayesian linear regression draw of regression coefficients and residual variance, based on the least squares parameters from estimice().

This function replicates the mice.impute.norm.draw() algorithm from the R mice package, as described in Rubin (1987, p. 167).

Parameters:
  • y (np.ndarray) – Numeric vector of length n, containing the variable to be imputed.

  • ry (np.ndarray of bool) – Boolean mask vector of length n, where True indicates observed values of y and False indicates missing values.

  • x (np.ndarray) – Numeric design matrix of shape (n, p) with predictors for y. Must have no missing values.

  • rank_adjust (bool, optional) – If True, replaces any NaN coefficients with zeros. This is relevant only when the least squares method is “qr” and the predictor matrix is rank-deficient. Default is True.

  • rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.

  • **kwargs (dict) – Additional keyword arguments passed to estimice(), e.g., ls_meth to specify the least squares method.

Returns:

Dictionary containing: - ‘coef’: Least squares coefficient estimates (numpy array). - ‘beta’: Bayesian drawn regression coefficients (numpy array). - ‘sigma’: Drawn residual standard deviation (float). - ‘estimation’: Least squares method used (string).

Return type:

dict

Notes

The residual variance sigma is drawn from a scaled chi-square distribution, and the regression coefficients beta are drawn from a multivariate normal centered at the least squares estimates with variance scaled by sigma.

References

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, p. 167.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, np.nan, 4.0])
>>> ry = ~np.isnan(y)
>>> x = np.array([[1, 0], [1, 1], [1, 2], [1, 3]])
>>> result = norm_draw(y, ry, x, ls_meth='qr')
>>> print(result['beta'])
imputation.sampler.estimice(x, y, ls_meth='qr', ridge=1e-05)[source]

Computes least squares estimates, residuals, variance-covariance matrix, and degrees of freedom using different methods: ridge regression, QR decomposition, or Singular Value Decomposition.

Parameters:
  • x (np.ndarray) – Numeric design matrix with shape (n_samples, n_predictors). Must not contain missing values.

  • y (np.ndarray) – Numeric vector of responses to be imputed, with possible missing values.

  • ls_meth (str, optional) – Least squares method to use. Options are: - “qr”: QR decomposition (default) - “ridge”: Ridge regression - “svd”: Singular Value Decomposition

  • ridge (float, optional) – Ridge penalty size for ridge regression. Default is 1e-5, balancing stability and bias.

Returns:

Dictionary containing: - ‘c’: Least squares coefficient estimates (numpy array). - ‘r’: Residuals (numpy array). - ‘v’: Variance-covariance matrix of coefficients (numpy array). - ‘df’: Degrees of freedom (int). - ‘ls_meth’: Method used (str).

Return type:

dict

Examples

>>> import numpy as np
>>> x = np.array([[1, 2], [3, 4], [5, 6]])
>>> y = np.array([7, np.nan, 9])
>>> # Assuming you handle missing y externally, e.g. ry = ~np.isnan(y)
>>> result = estimice(x[~np.isnan(y)], y[~np.isnan(y)], ls_meth="qr")
>>> print(result['c'])
[-6.   6.5]

See Also