Utilities

Helper functions for configuration, validation, and predictor selection.

Logging Configuration

Logging configuration module for the imputation package.

This module provides proper logging setup following Python best practices, allowing users to configure logging behavior without affecting the global logging state.

imputation.logging_config.setup_logging(level='INFO', log_dir=None, console=True, file_logging=True, format_string=None, console_level=None, file_level=None, max_bytes=5242880, backup_count=5)[source]

Configure logging for the imputation package.

This function sets up a package-specific logger without affecting the root logger or other packages. It’s safe to call multiple times.

Parameters:

level (str or int, default="INFO") – Base logging level for the package (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’) or logging level constant (e.g., logging.DEBUG)
log_dir (str, optional) – Directory for log files. If None, uses ‘./logs’
console (bool, default=True) – Whether to enable console logging
file_logging (bool, default=True) – Whether to enable file logging
format_string (str, optional) – Custom format string for log messages. If None, uses default format
console_level (str or int, optional) – Logging level for console handler. If None, uses ‘INFO’
file_level (str or int, optional) – Logging level for file handler. If None, uses ‘DEBUG’
max_bytes (int, default=5MB) – Maximum size of log file before rotation
backup_count (int, default=5) – Number of backup log files to keep

Returns:

The configured package logger

Return type:

logging.Logger

Examples

Basic usage: >>> import imputation >>> logger = imputation.setup_logging()

Custom configuration: >>> logger = imputation.setup_logging( … level=’DEBUG’, … log_dir=’my_logs’, … console_level=’WARNING’ … )

Disable file logging: >>> logger = imputation.setup_logging(file_logging=False)

imputation.logging_config.get_logger(name)[source]

Get a logger for a specific module within the imputation package.

This function returns a child logger of the main package logger, ensuring proper hierarchy and inheritance of configuration.

Parameters:: name (str) – Module name, typically __name__ or a descriptive string
Returns:: A logger instance for the specified module
Return type:: logging.Logger

Examples

In a module file: >>> from imputation.logging_config import get_logger >>> logger = get_logger(__name__) >>> logger.info(“This is a log message”)

For simulation scripts: >>> logger = get_logger(‘imputation.simulation.fdgs’)

imputation.logging_config.disable_logging()[source]

Disable logging for the imputation package.

This is useful for testing or when logging output is not desired.

imputation.logging_config.reset_logging()[source]

Reset logging configuration to default state.

This removes all handlers and sets the logger back to default configuration with only a NullHandler.

Validation Functions

imputation.validators.check_n_imputations(n_imputations)[source]

Check if the number of imputations is valid and provide a warning if it’s high.

Parameters:: n_imputations (int) – Number of imputations to perform
Raises:: ValueError – If n_imputations is not a positive integer

imputation.validators.check_maxit(maxit)[source]

Check if the maximum iterations parameter is valid and provide a warning if it’s high.

Parameters:: maxit (int) – Maximum number of iterations for each imputation cycle
Raises:: ValueError – If maxit is not a positive integer

imputation.validators.check_method(method, columns)[source]

Check and process the method parameter for MICE imputation.

Parameters:

method (Union[str, Dict[str, str]]) – Method specification. Can be: - str: use the same method for all columns - Dict[str, str]: dictionary mapping column names to their methods
columns (List[str]) – List of column names in the data

Returns:

Dictionary mapping each column to its imputation method

Return type:

Dict[str, str]

Raises:

ValueError – If method is invalid or references non-existent columns

imputation.validators.check_initial_method(initial_method)[source]

Check if the initial imputation method is valid.

Parameters:: initial_method (str) – Initial imputation method to validate
Raises:: ValueError – If initial_method is not a valid initial imputation method

imputation.validators.check_visit_sequence(visit_sequence, columns, columns_with_missing=None)[source]

Check and process the visit sequence parameter for MICE imputation.

Parameters:

visit_sequence (Union[str, List[str]]) – Visit sequence specification. Can be: - str: “monotone” or “random” for predefined sequences - List[str]: list of column names specifying the order to visit variables
columns (List[str]) – List of all column names in the data
columns_with_missing (List[str], optional) – List of columns that have missing values. If provided, will validate that all these columns are included in a custom visit sequence.

Returns:

(validated_sequence, columns_without_missing) where: - validated_sequence: the processed visit sequence (only for list input, None for string) - columns_without_missing: list of columns in sequence that don’t have missing values

Return type:

tuple

Raises:

ValueError – If visit_sequence is invalid or references non-existent columns

Notes

For string visit sequences (“monotone”, “random”), the actual sequence will be generated in MICE._set_visit_sequence() based on the data.

For list visit sequences, this function validates that: 1. All columns exist in the data 2. No duplicate columns 3. All columns with missing values are included (if columns_with_missing provided)

imputation.validators.validate_predictor_matrix(predictor_matrix, data_columns, data)[source]

Validate predictor matrix for MICE imputation.

Parameters:

predictor_matrix (pd.DataFrame) – Binary matrix indicating which variables should be used as predictors for each target variable. Rows represent target variables, columns represent predictors. A 1 indicates that the column variable is used as predictor for the index variable.
data_columns (List[str]) – List of column names in the data to validate against
data (pd.DataFrame) – The data to check for missing values

Returns:

Validated predictor matrix

Return type:

pd.DataFrame

Raises:

ValueError – If predictor_matrix has invalid structure or column names don’t match data

imputation.validators.validate_columns(data)[source]

Validate and clean columns in the DataFrame.

Checks for columns with only NaN values and drops them with appropriate warnings.

Parameters:: data (pd.DataFrame) – Input DataFrame to validate
Returns:: DataFrame with invalid columns removed
Return type:: pd.DataFrame
Warns:: UserWarning – If columns with only NaN values are found and dropped

Notes

Missing data values that are treated as NaN: - pandas NaN (numpy.nan)

imputation.validators.validate_dataframe(data)[source]

Check and validate input data for MICE imputation.

Parameters:: data (Any) – Input data to be checked and converted to DataFrame
Returns:: Validated and cleaned DataFrame
Return type:: pd.DataFrame
Raises:: ValueError – If data cannot be converted to DataFrame or has duplicate column names

Notes

Missing data values that are treated as NaN: - pandas NaN (numpy.nan)

imputation.validators.validate_formula(formula, columns)[source]

Validate that all variables in the formula exist in the dataset columns.

Parameters:

formula (str) – The formula string to validate
columns (List[str]) – List of column names in the dataset

Raises:

ValueError – If any variables in the formula are not found in the columns

Utility Functions

imputation.utils.get_imputer_func(method_name)[source]

Return the imputer callable for method_name.

Parameters:: method_name (str) – Name of the imputation method. Must be one of the values defined in imputation.constants.ImputationMethod.
Returns:: The function implementing the requested imputation strategy.
Return type:: Callable
Raises:: ValueError – If method_name is unknown or not yet implemented.

Sampler Functions

imputation.sampler.sym(x)[source]

Ensures the input square matrix is symmetric by averaging it with its transpose.

Parameters:: x (np.ndarray) – A square numpy matrix.
Returns:: A symmetric matrix computed as (x + x.T) / 2.
Return type:: np.ndarray

imputation.sampler.norm_draw(y, ry, x, rank_adjust=True, rng=None, **kwargs)[source]

Bayesian linear regression draw of regression coefficients and residual variance, based on the least squares parameters from estimice().

This function replicates the mice.impute.norm.draw() algorithm from the R mice package, as described in Rubin (1987, p. 167).

Parameters:

y (np.ndarray) – Numeric vector of length n, containing the variable to be imputed.
ry (np.ndarray of bool) – Boolean mask vector of length n, where True indicates observed values of y and False indicates missing values.
x (np.ndarray) – Numeric design matrix of shape (n, p) with predictors for y. Must have no missing values.
rank_adjust (bool, optional) – If True, replaces any NaN coefficients with zeros. This is relevant only when the least squares method is “qr” and the predictor matrix is rank-deficient. Default is True.
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
**kwargs (dict) – Additional keyword arguments passed to estimice(), e.g., ls_meth to specify the least squares method.

Returns:

Dictionary containing: - ‘coef’: Least squares coefficient estimates (numpy array). - ‘beta’: Bayesian drawn regression coefficients (numpy array). - ‘sigma’: Drawn residual standard deviation (float). - ‘estimation’: Least squares method used (string).

Return type:

dict

Notes

The residual variance sigma is drawn from a scaled chi-square distribution, and the regression coefficients beta are drawn from a multivariate normal centered at the least squares estimates with variance scaled by sigma.

References

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, p. 167.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, np.nan, 4.0])
>>> ry = ~np.isnan(y)
>>> x = np.array([[1, 0], [1, 1], [1, 2], [1, 3]])
>>> result = norm_draw(y, ry, x, ls_meth='qr')
>>> print(result['beta'])

imputation.sampler.estimice(x, y, ls_meth='qr', ridge=1e-05)[source]

Computes least squares estimates, residuals, variance-covariance matrix, and degrees of freedom using different methods: ridge regression, QR decomposition, or Singular Value Decomposition.

Parameters:

x (np.ndarray) – Numeric design matrix with shape (n_samples, n_predictors). Must not contain missing values.
y (np.ndarray) – Numeric vector of responses to be imputed, with possible missing values.
ls_meth (str, optional) – Least squares method to use. Options are: - “qr”: QR decomposition (default) - “ridge”: Ridge regression - “svd”: Singular Value Decomposition
ridge (float, optional) – Ridge penalty size for ridge regression. Default is 1e-5, balancing stability and bias.

Returns:

Dictionary containing: - ‘c’: Least squares coefficient estimates (numpy array). - ‘r’: Residuals (numpy array). - ‘v’: Variance-covariance matrix of coefficients (numpy array). - ‘df’: Degrees of freedom (int). - ‘ls_meth’: Method used (str).

Return type:

dict

Examples

>>> import numpy as np
>>> x = np.array([[1, 2], [3, 4], [5, 6]])
>>> y = np.array([7, np.nan, 9])
>>> # Assuming you handle missing y externally, e.g. ry = ~np.isnan(y)
>>> result = estimice(x[~np.isnan(y)], y[~np.isnan(y)], ls_meth="qr")
>>> print(result['c'])
[-6.   6.5]

Utilities

Logging Configuration

Validation Functions

Utility Functions

Sampler Functions

See Also