Optimizer functions#

regmmd.optimizers._gd_backtracking_lg_loc_tilde_regression(X, y, par_v, par_c, n_step=1000, stepsize=1.0, bandwidth=1.0, alpha=0.8, eps_gd=1e-05)[source]#

Fit a LinearGaussianLoc regression model via exact gradient descent with backtracking line search on the tilde MMD criterion with a Gaussian kernel.

Computes the exact tilde MMD gradient analytically for a linear Gaussian location model with known variance par_c and a Gaussian kernel, avoiding Monte Carlo sampling entirely. In this case, the full expectation of the kernel * score can be calculated explicitly.

The objective and gradient therefore reduce to closed-form expressions in the residuals, enabling deterministic (non-stochastic) gradient descent with backtracking.

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Observed target values.
par_v (np.array, shape (n_features,)) – Initial value of the regression coefficients (beta).
par_c (float) – Known variance of the Gaussian noise (phi = sigma^2). Not optimized.
n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Shared across iterations: once reduced by backtracking, the smaller value carries forward to the next step.
bandwidth (float or str, default=1.0) – Bandwidth \gamma for the Gaussian kernel applied to y. If "auto", selected using the median heuristic on y.
alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy 0 < alpha < 1.
eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when log|f_new - f_old| - log|f_old| < log(eps_gd).

Returns:

res – Dictionary containing:

par_v_init : initial regression coefficients.
par_c_init : known variance.
stepsize : initial step size.
bandwidth : bandwidth used (resolved if "auto").
estimator : last parameter iterate (point estimate).
trajectory : parameter trajectory of shape (n_features, n_step_done + 1).
convergence : 0 if converged, 1 if max iterations reached.

Return type:

MMDResult

regmmd.optimizers._gd_backtracking_lg_tilde_regression(X, y, par_v, par_c, n_step=1000, stepsize=1.0, bandwidth=1.0, alpha=0.8, eps_gd=1e-05)[source]#

Fit a LinearGaussian regression model via exact gradient descent with backtracking line search on the tilde MMD criterion with a Gaussian kernel.

Extends _gd_backtracking_lg_loc_tilde_regression() to the full model where both the regression coefficients beta and the noise variance phi are optimized jointly. The variance is reparametrized as log(phi) for unconstrained optimization.

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Observed target values.
par_v (np.array, shape (n_features + 1,)) – Initial parameter vector [beta_0, ..., beta_p, phi] where phi > 0 is the noise variance.
par_c (array or None, default=None) – Unused; included for API consistency.
n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Carried across iterations: once reduced by backtracking, the smaller value persists to the next step.
bandwidth (float or str, default=1.0) – Bandwidth h for the Gaussian kernel applied to y. If "auto", selected using the median heuristic on y.
alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy 0 < alpha < 1.
eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when log|f_new - f_old| - log|f_old| < log(eps_gd).

Returns:

res – Dictionary containing:

par_v_init : initial parameter vector [beta, phi].
par_c_init : None (no fixed parameters).
stepsize : initial step size.
bandwidth : bandwidth used (resolved if "auto").
estimator : last parameter iterate [beta, phi].
trajectory : parameter trajectory of shape (n_features + 1, n_step_done + 1), stored in the original [beta, phi] space (not log-space).
convergence : 0 if converged, 1 if max iterations reached.

Return type:

MMDResult

regmmd.optimizers._gd_backtracking_logistic_tilde_regression(X, y, par_v, n_step=1000, stepsize=1.0, bandwidth=1.0, kernel='Gaussian', alpha=0.8, eps_gd=1e-05)[source]#

Fit a Logistic regression model via exact gradient descent with backtracking line search on the tilde MMD criterion.

Computes the exact tilde MMD gradient analytically for a logistic model, avoiding Monte Carlo sampling entirely. Since \(Y | X \sim \text{Bernoulli}(p)\) with \(p = \text{sigmoid}(X^{\top}\beta)\), the expectations involving \(Y\) reduce to closed-form expressions in \(p_i\), enabling deterministic gradient descent with backtracking.

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Binary target values in {0, 1}.
par_v (np.array, shape (n_features,)) – Initial value of the regression coefficients (beta).
par_c (array or None, default=None) – Unused; included for API consistency.
n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Carried across iterations: once reduced by backtracking, the smaller value persists to the next step.
bandwidth (float or str, default=1.0) – Bandwidth for the kernel applied to y. If "auto", selected using the median heuristic on y.
kernel (str, default="Gaussian") – Kernel type. Supported options are "Gaussian", "Laplace", and "Cauchy".
alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy 0 < alpha < 1.
eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when log|f_new - f_old| - log|f_old| < log(eps_gd).

Returns:

res – Dictionary containing:

par_v_init : initial regression coefficients.
par_c_init : None (no fixed parameters).
stepsize : initial step size.
bandwidth : bandwidth used (resolved if "auto").
estimator : last parameter iterate (point estimate).
trajectory : parameter trajectory of shape (n_features, n_step_done + 1).
convergence : 0 if converged, 1 if max iterations reached.

Return type:

MMDResult

regmmd.optimizers._gd_exact_logistic_hat_regression(X, y, par_v, par_c=None, kernel='Gaussian', kernel_x='Laplace', burn_in=500, n_step=1000, stepsize=1.0, bandwidth_y='auto', bandwidth_x='auto', c_det=0.2, c_rand=0.1, epsilon=0.0001, eps_sq=1e-05, rng=Generator(PCG64) at 0x7F5BC487E5E0)[source]#

Fit a Logistic regression model using the hat estimator via AdaGrad with exact (analytical) gradients, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier, Gerber (2024).

Replaces the Monte Carlo Y-sampling in _sgd_hat_regression() with closed-form gradient expressions for the logistic model. Since \(Y | X \sim \text{Bernoulli}(p)\) with \(p = \text{sigmoid}(X^{\top} \beta)\), the expectations \(\mathbb{E}[k_Y(Y_i, Y_j)]\) and \(\mathbb{E}[k_Y(Y_i, y_i)]\) reduce to simple functions of \(p_i\) and \(p_j\), so the gradient is computed analytically for each pair \((i, j)\).

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Binary target values in {0, 1}.
par_v (np.array, shape (n_features,)) – Initial regression coefficients (beta).
par_c (array or None, default=None) – Unused; included for API consistency.
kernel (str, default="Gaussian") – Kernel applied to y. Supported options are "Gaussian", "Laplace", and "Cauchy".
kernel_x (str, default="Laplace") – Kernel applied to X. Supported options are "Gaussian", "Laplace", and "Cauchy".
burn_in (int, default=500) – Number of burn-in AdaGrad iterations (parameters not averaged).
n_step (int, default=1000) – Maximum number of averaging iterations after burn-in.
stepsize (float, default=1.0) – Initial AdaGrad step size.
bandwidth_y (float or str, default="auto") – Bandwidth for the kernel on y. If "auto", uses the median heuristic.
bandwidth_x (float or str, default="auto") – Bandwidth for the kernel on X. If "auto", uses the median heuristic.
c_det (float, default=0.2) – Fraction of n determining the number of deterministic nearest pairs.
c_rand (float, default=0.1) – Fraction of n determining the number of randomly sampled distant pairs.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm for AdaGrad stability.
eps_sq (float, default=1e-5) – Convergence threshold; stops when log(avg_grad_norm) < log(eps_sq).
rng (np.random.Generator, default=np.random.default_rng(10)) – Random number generator for sampling distant pairs.

Returns:

res – Dictionary containing:

par_v_init : initial regression coefficients.
par_c_init : None.
stepsize : step size used.
bandwidth_y : bandwidth used for y.
bandwidth_x : bandwidth used for X.
estimator : Polyak-Ruppert cumulative average of iterates.
trajectory : cumulative-average trajectory of shape (n_features, n_step_done).
convergence : 0 if converged, 1 if max iterations reached, -1 if NaN.

Return type:

MMDResult

regmmd.optimizers._gd_gaussian_loc_exact_estimation(X, par_v, par_c, burn_in=500, n_step=1000, stepsize=1.0, bandwidth=1.0, epsilon=0.0001)[source]#

Estimate the location parameter of a Gaussian model with Gaussian kernel using exact MMD gradient descent.

Computes the exact MMD gradient analytically for a Gaussian location model with known scale parameter par_c and a Gaussian Kernel, avoiding Monte Carlo sampling. Uses AdaGrad updates with a burn-in phase followed by Polyak-Ruppert averaging.

Parameters:

X (np.array, shape (n_samples,)) – Univariate observed data.
par_v (float) – Initial value of the location parameter (mean) to be estimated.
par_c (float) – Known scale parameter (standard deviation) of the Gaussian model.
burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not averaged.
n_step (int, default=1000) – Number of averaging iterations following the burn-in phase.
stepsize (float, default=1.0) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth parameter for the Gaussian kernel. If "auto", the bandwidth is selected using the median heuristic.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.

Returns:

res – Dictionary containing:

par_v_init : initial location parameter.
par_c_init : initial scale parameter.
stepsize : step size used.
bandwidth : bandwidth used (resolved if "auto").
estimator : Polyak-Ruppert average of location parameter iterates.
trajectory : parameter trajectory of shape (burn_in + n_step + 1,).

Return type:

MMDResult

regmmd.optimizers._get_grad_estimate(set_1, set_2, X, K_X, y_sampled_1, y_sampled_2, y, model, kernel_y, bandwidth_y)[source]#

Compute a partial gradient estimate for the hat estimator objective.

Evaluates the gradient contribution from a specified subset of observation pairs \((i, j)\). When set_1 and set_2 are provided, the gradient is weighted by the covariate kernel \(k_X\) evaluated at those pairs. When both are None, the diagonal term \((i = j)\) is computed without covariate kernel weighting.

Parameters:

set_1 (np.array[np.int32] or None) – Row indices of the first element of each pair. If None, the diagonal \((i = j)\) contribution is computed.
set_2 (np.array[np.int32] or None) – Row indices of the second element of each pair. If None, the diagonal \((i = j)\) contribution is computed.
X (np.array, shape (n_samples, n_features)) – Training input samples.
K_X (np.array or None) – Precomputed covariate kernel values for the pairs defined by set_1 and set_2. Ignored when set_1 is None.
y_sampled_1 (np.array, shape (n_samples,)) – First set of samples drawn from the model’s conditional distribution.
y_sampled_2 (np.array, shape (n_samples,)) – Second set of samples drawn from the model’s conditional distribution.
y (np.array, shape (n_samples,)) – Observed target values.
model (RegressionModel) – The regression model. Must implement score.
kernel_y (str) – Kernel type applied to the target variable y.
bandwidth_y (float) – Bandwidth for the kernel applied to y.

Returns:

grad_estimate – Gradient estimate contributions from the specified pairs. Shape is (n_params,) when set_1 is None (diagonal term), or (len(set_1), n_params) for off-diagonal pairs.

Return type:

np.array

regmmd.optimizers._median_heuristic(X)[source]#

Compute the median heuristic for kernel bandwidth selection.

Estimates the bandwidth as the median of pairwise Euclidean distances divided by sqrt(2), a common data-driven heuristic for kernel methods.

Parameters:: X (np.array, shape (n_samples,) or (n_samples, n_features)) – Input data.
Returns:: median_dist – Estimated bandwidth. Returns 1 if fewer than two samples are provided.
Return type:: float

regmmd.optimizers._sgd_estimation(X, par_v, par_c, model, kernel, burn_in=500, n_step=1000, stepsize=1, bandwidth=1.0, epsilon=0.0001, use_fast=True)[source]#

Estimate model parameters via AdaGrad stochastic gradient descent on the MMD criterion.

Minimizes the MMD between the empirical distribution of X and the model’s distribution using a stochastic gradient algorithm. The optimization runs a burn-in phase (without averaging) followed by a main phase with Polyak-Ruppert averaging of the iterates.

Parameters:

X (np.array, shape (n_samples,) or (n_samples, n_features)) – Observed data used to fit the model.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (array or None, default=None) – Unused; included for API consistency.
model (EstimationModel) – The statistical model to fit. Must implement sample_n, score, update, and _project_params.
kernel (str) – Kernel type used for the MMD computation. Supported options are "Gaussian", "Laplace", and "Cauchy".
burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not averaged.
n_step (int, default=1000) – Number of averaging iterations following the burn-in phase.
stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth parameter for the kernel. If "auto", the bandwidth is selected using the median heuristic.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
use_fast (bool)

Returns:

res – Dictionary containing:

par_v_init : initial variable parameters.
stepsize : step size used.
bandwidth : bandwidth used (resolved if "auto").
estimator : Polyak-Ruppert average of parameter iterates.
trajectoryparameter trajectory of shape
(n_params, burn_in + n_step + 1).

Return type:

MMDResult

regmmd.optimizers._sgd_hat_regression(X, y, par_v, par_c, model, kernel_y, kernel_x='Laplace', burn_in=500, n_step=1000, stepsize=1, bandwidth_y='auto', bandwidth_x='auto', c_det=0.2, c_rand=0.1, epsilon=0.0001, eps_sq=1e-05, rng=Generator(PCG64) at 0x7F5BC487E180)[source]#

Fit a regression model using the hat estimator via stochastic gradient descent, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier, Gerber (2024).

Minimizes the MMD objective using the product kernel \(k = k_X \otimes k_Y\). The gradient is approximated efficiently by splitting pairs \((X_i, X_j)\) into three groups: all diagonal pairs, the M_det closest off-diagonal pairs (deterministic), and M_rand randomly selected distant pairs (stochastic). This yields a gradient estimator with cost linear in n per iteration.

Compared to the tilde estimator, this estimator is robust to adversarial contamination of the training data, but is computationally more expensive due to the preprocessing of pairwise distances. See Alquier and Gerber (2024), Section 4.

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Target values.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (np.array) – Constant model parameters that are not optimized.
model (RegressionModel) – The regression model to fit. Must implement predict, sample_n, score, update, and _project_params.
kernel_y (str) – Kernel type applied to the target variable y. Supported options are "Gaussian", "Laplace", and "Cauchy".
kernel_x (str, default="Laplace") – Kernel type applied to the covariates X. Supported options are "Gaussian", "Laplace", and "Cauchy".
burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not included in the running average.
n_step (int, default=1000) – Maximum number of averaging iterations following the burn-in phase. Early stopping is applied if the average gradient norm falls below eps_sq.
stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth_y (float or str, default="auto") – Bandwidth for the kernel applied to y. If "auto", selected using the median heuristic.
bandwidth_x (float or str, default="auto") – Bandwidth for the kernel applied to X. If "auto", selected using the median heuristic.
c_det (float, default=0.2) – Fraction of n used to determine the number of deterministic (closest) off-diagonal pairs included per iteration.
c_rand (float, default=0.1) – Fraction of n used to determine the number of randomly selected distant pairs included per iteration.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
eps_sq (float, default=1e-5) – Convergence threshold for early stopping. Optimization stops when the log of the average gradient norm falls below log(eps_sq).
rng (np.random.Generator, default=np.random.default_rng(10)) – Random number generator used for sampling distant pairs.

Returns:

res – Dictionary containing:

par_v_init : initial variable parameters.
par_c_init : initial constant parameters.
stepsize : step size used.
bandwidth_y : bandwidth used for y (resolved if "auto").
bandwidth_x : bandwidth used for X (resolved if "auto").
estimator : Polyak-Ruppert average of parameter iterates.
trajectory : cumulative-average parameter trajectory of shape (n_params, n_step_done).

Return type:

MMDResult

regmmd.optimizers._sgd_tilde_regression(X, y, par_v, par_c, model, kernel, burn_in=500, n_step=1000, stepsize=1, bandwidth=1.0, epsilon=0.0001, eps_sq=1e-05, use_fast=True)[source]#

Fit a regression model using the tilde estimator via stochastic gradient descent, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier and Gerber (2024).

Minimizes the MMD objective using only the kernel \(k_Y\) on the target variable. The gradient is computed using pairs of samples drawn from the model’s conditional distribution at the observed \(X\) values, giving a cost linear in \(n\) per iteration.

Compared to the hat estimator, this estimator is computationally cheaper as it avoids the \(O(n^2)\) preprocessing of pairwise covariate distances, but enjoys slightly weaker robustness guarantees, see Section 4, Alquier, Gerber (2024)

Parameters:

X (np.array, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Target values.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (np.array) – Constant model parameters that are not optimized.
model (RegressionModel) – The regression model to fit. Must implement predict, sample_n, score, update, and _project_params.
kernel (str) – Kernel type applied to the target variable y. Supported options are "Gaussian", "Laplace", and "Cauchy".
burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not included in the running average.
n_step (int, default=1000) – Maximum number of averaging iterations following the burn-in phase. Early stopping is applied if the average gradient norm falls below eps_sq.
stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth for the kernel applied to y. If "auto", selected using the median heuristic on X.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
eps_sq (float, default=1e-5) – Convergence threshold for early stopping. Optimization stops when the log of the average gradient norm falls below log(eps_sq).
use_fast (bool)

Returns:

res – Dictionary containing:

par_v_init : initial variable parameters.
par_c_init : initial constant parameters.
stepsize : step size used.
bandwidth : bandwidth used (resolved if "auto").
estimator : Polyak-Ruppert average of parameter iterates.
trajectory : cumulative-average parameter trajectory of shape (n_params, n_step_done).

Return type:

MMDResult

regmmd.optimizers.sort_obs(X)[source]#

Sort all pairs of observations by their pairwise Euclidean distance.

Computes pairwise distances between all rows of \(X\) and returns the upper-triangular index pairs and corresponding distances, sorted from closest to most distant. Used to efficiently select the nearest pairs in the hat estimator gradient computation.

Parameters:

X (np.array, shape (n_samples, n_features)) – Input data whose rows are the observations to be paired.

Returns:

result – Dictionary with two keys:

"DIST" : np.array of shape (n*(n-1)//2,), pairwise distances sorted in ascending order.
"IND" : np.array of shape (n*(n-1)//2, 2), row index pairs (i, j) with i < j corresponding to each distance in "DIST".

Return type:

dict