Utilities
Helper functions for configuration, validation, and predictor selection.
Logging Configuration
Logging configuration module for the imputation package.
This module provides proper logging setup following Python best practices, allowing users to configure logging behavior without affecting the global logging state.
- imputation.logging_config.setup_logging(level='INFO', log_dir=None, console=True, file_logging=True, format_string=None, console_level=None, file_level=None, max_bytes=5242880, backup_count=5)[source]
Configure logging for the imputation package.
This function sets up a package-specific logger without affecting the root logger or other packages. It’s safe to call multiple times.
- Parameters:
level (str or int, default="INFO") – Base logging level for the package (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’) or logging level constant (e.g., logging.DEBUG)
log_dir (str, optional) – Directory for log files. If None, uses ‘./logs’
console (bool, default=True) – Whether to enable console logging
file_logging (bool, default=True) – Whether to enable file logging
format_string (str, optional) – Custom format string for log messages. If None, uses default format
console_level (str or int, optional) – Logging level for console handler. If None, uses ‘INFO’
file_level (str or int, optional) – Logging level for file handler. If None, uses ‘DEBUG’
max_bytes (int, default=5MB) – Maximum size of log file before rotation
backup_count (int, default=5) – Number of backup log files to keep
- Returns:
The configured package logger
- Return type:
Examples
Basic usage: >>> import imputation >>> logger = imputation.setup_logging()
Custom configuration: >>> logger = imputation.setup_logging( … level=’DEBUG’, … log_dir=’my_logs’, … console_level=’WARNING’ … )
Disable file logging: >>> logger = imputation.setup_logging(file_logging=False)
- imputation.logging_config.get_logger(name)[source]
Get a logger for a specific module within the imputation package.
This function returns a child logger of the main package logger, ensuring proper hierarchy and inheritance of configuration.
- Parameters:
name (str) – Module name, typically __name__ or a descriptive string
- Returns:
A logger instance for the specified module
- Return type:
Examples
In a module file: >>> from imputation.logging_config import get_logger >>> logger = get_logger(__name__) >>> logger.info(“This is a log message”)
For simulation scripts: >>> logger = get_logger(‘imputation.simulation.fdgs’)
Validation Functions
- imputation.validators.check_n_imputations(n_imputations)[source]
Check if the number of imputations is valid and provide a warning if it’s high.
- Parameters:
n_imputations (int) – Number of imputations to perform
- Raises:
ValueError – If n_imputations is not a positive integer
- imputation.validators.check_maxit(maxit)[source]
Check if the maximum iterations parameter is valid and provide a warning if it’s high.
- Parameters:
maxit (int) – Maximum number of iterations for each imputation cycle
- Raises:
ValueError – If maxit is not a positive integer
- imputation.validators.check_method(method, columns)[source]
Check and process the method parameter for MICE imputation.
- Parameters:
- Returns:
Dictionary mapping each column to its imputation method
- Return type:
- Raises:
ValueError – If method is invalid or references non-existent columns
- imputation.validators.check_initial_method(initial_method)[source]
Check if the initial imputation method is valid.
- Parameters:
initial_method (str) – Initial imputation method to validate
- Raises:
ValueError – If initial_method is not a valid initial imputation method
- imputation.validators.check_visit_sequence(visit_sequence, columns, columns_with_missing=None)[source]
Check and process the visit sequence parameter for MICE imputation.
- Parameters:
visit_sequence (Union[str, List[str]]) – Visit sequence specification. Can be: - str: “monotone” or “random” for predefined sequences - List[str]: list of column names specifying the order to visit variables
columns (List[str]) – List of all column names in the data
columns_with_missing (List[str], optional) – List of columns that have missing values. If provided, will validate that all these columns are included in a custom visit sequence.
- Returns:
(validated_sequence, columns_without_missing) where: - validated_sequence: the processed visit sequence (only for list input, None for string) - columns_without_missing: list of columns in sequence that don’t have missing values
- Return type:
- Raises:
ValueError – If visit_sequence is invalid or references non-existent columns
Notes
For string visit sequences (“monotone”, “random”), the actual sequence will be generated in MICE._set_visit_sequence() based on the data.
For list visit sequences, this function validates that: 1. All columns exist in the data 2. No duplicate columns 3. All columns with missing values are included (if columns_with_missing provided)
- imputation.validators.validate_predictor_matrix(predictor_matrix, data_columns, data)[source]
Validate predictor matrix for MICE imputation.
- Parameters:
predictor_matrix (pd.DataFrame) – Binary matrix indicating which variables should be used as predictors for each target variable. Rows represent target variables, columns represent predictors. A 1 indicates that the column variable is used as predictor for the index variable.
data_columns (List[str]) – List of column names in the data to validate against
data (pd.DataFrame) – The data to check for missing values
- Returns:
Validated predictor matrix
- Return type:
pd.DataFrame
- Raises:
ValueError – If predictor_matrix has invalid structure or column names don’t match data
- imputation.validators.validate_columns(data)[source]
Validate and clean columns in the DataFrame.
Checks for columns with only NaN values and drops them with appropriate warnings.
- Parameters:
data (pd.DataFrame) – Input DataFrame to validate
- Returns:
DataFrame with invalid columns removed
- Return type:
pd.DataFrame
- Warns:
UserWarning – If columns with only NaN values are found and dropped
Notes
Missing data values that are treated as NaN: - pandas NaN (numpy.nan)
- imputation.validators.validate_dataframe(data)[source]
Check and validate input data for MICE imputation.
- Parameters:
data (Any) – Input data to be checked and converted to DataFrame
- Returns:
Validated and cleaned DataFrame
- Return type:
pd.DataFrame
- Raises:
ValueError – If data cannot be converted to DataFrame or has duplicate column names
Notes
Missing data values that are treated as NaN: - pandas NaN (numpy.nan)
- imputation.validators.validate_formula(formula, columns)[source]
Validate that all variables in the formula exist in the dataset columns.
- Parameters:
- Raises:
ValueError – If any variables in the formula are not found in the columns
Utility Functions
- imputation.utils.get_imputer_func(method_name)[source]
Return the imputer callable for method_name.
- Parameters:
method_name (str) – Name of the imputation method. Must be one of the values defined in
imputation.constants.ImputationMethod.- Returns:
The function implementing the requested imputation strategy.
- Return type:
Callable
- Raises:
ValueError – If method_name is unknown or not yet implemented.
Sampler Functions
- imputation.sampler.sym(x)[source]
Ensures the input square matrix is symmetric by averaging it with its transpose.
- Parameters:
x (np.ndarray) – A square numpy matrix.
- Returns:
A symmetric matrix computed as (x + x.T) / 2.
- Return type:
np.ndarray
- imputation.sampler.norm_draw(y, ry, x, rank_adjust=True, rng=None, **kwargs)[source]
Bayesian linear regression draw of regression coefficients and residual variance, based on the least squares parameters from estimice().
This function replicates the mice.impute.norm.draw() algorithm from the R mice package, as described in Rubin (1987, p. 167).
- Parameters:
y (np.ndarray) – Numeric vector of length n, containing the variable to be imputed.
ry (np.ndarray of bool) – Boolean mask vector of length n, where True indicates observed values of y and False indicates missing values.
x (np.ndarray) – Numeric design matrix of shape (n, p) with predictors for y. Must have no missing values.
rank_adjust (bool, optional) – If True, replaces any NaN coefficients with zeros. This is relevant only when the least squares method is “qr” and the predictor matrix is rank-deficient. Default is True.
rng (np.random.Generator, optional) – Random number generator for reproducibility. If None, a fresh generator is used.
**kwargs (dict) – Additional keyword arguments passed to estimice(), e.g., ls_meth to specify the least squares method.
- Returns:
Dictionary containing: - ‘coef’: Least squares coefficient estimates (numpy array). - ‘beta’: Bayesian drawn regression coefficients (numpy array). - ‘sigma’: Drawn residual standard deviation (float). - ‘estimation’: Least squares method used (string).
- Return type:
Notes
The residual variance sigma is drawn from a scaled chi-square distribution, and the regression coefficients beta are drawn from a multivariate normal centered at the least squares estimates with variance scaled by sigma.
References
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, p. 167.
Examples
>>> import numpy as np >>> y = np.array([1.0, 2.0, np.nan, 4.0]) >>> ry = ~np.isnan(y) >>> x = np.array([[1, 0], [1, 1], [1, 2], [1, 3]]) >>> result = norm_draw(y, ry, x, ls_meth='qr') >>> print(result['beta'])
- imputation.sampler.estimice(x, y, ls_meth='qr', ridge=1e-05)[source]
Computes least squares estimates, residuals, variance-covariance matrix, and degrees of freedom using different methods: ridge regression, QR decomposition, or Singular Value Decomposition.
- Parameters:
x (np.ndarray) – Numeric design matrix with shape (n_samples, n_predictors). Must not contain missing values.
y (np.ndarray) – Numeric vector of responses to be imputed, with possible missing values.
ls_meth (str, optional) – Least squares method to use. Options are: - “qr”: QR decomposition (default) - “ridge”: Ridge regression - “svd”: Singular Value Decomposition
ridge (float, optional) – Ridge penalty size for ridge regression. Default is 1e-5, balancing stability and bias.
- Returns:
Dictionary containing: - ‘c’: Least squares coefficient estimates (numpy array). - ‘r’: Residuals (numpy array). - ‘v’: Variance-covariance matrix of coefficients (numpy array). - ‘df’: Degrees of freedom (int). - ‘ls_meth’: Method used (str).
- Return type:
Examples
>>> import numpy as np >>> x = np.array([[1, 2], [3, 4], [5, 6]]) >>> y = np.array([7, np.nan, 9]) >>> # Assuming you handle missing y externally, e.g. ry = ~np.isnan(y) >>> result = estimice(x[~np.isnan(y)], y[~np.isnan(y)], ls_meth="qr") >>> print(result['c']) [-6. 6.5]
See Also
Predictor Matrices for guidance on using quickpred
Best Practices for logging recommendations