Optimizer functions#
- regmmd.optimizers._gd_backtracking_lg_loc_tilde_regression(X, y, par_v, par_c, n_step=1000, stepsize=1.0, bandwidth=1.0, alpha=0.8, eps_gd=1e-05)[source]#
Fit a LinearGaussianLoc regression model via exact gradient descent with backtracking line search on the tilde MMD criterion with a Gaussian kernel.
Computes the exact tilde MMD gradient analytically for a linear Gaussian location model with known variance
par_cand a Gaussian kernel, avoiding Monte Carlo sampling entirely. In this case, the full expectation of thekernel * scorecan be calculated explicitly.The objective and gradient therefore reduce to closed-form expressions in the residuals, enabling deterministic (non-stochastic) gradient descent with backtracking.
- Parameters:
X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Observed target values.
par_v (np.array, shape (n_features,)) – Initial value of the regression coefficients (beta).
par_c (float) – Known variance of the Gaussian noise (phi = sigma^2). Not optimized.
n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Shared across iterations: once reduced by backtracking, the smaller value carries forward to the next step.
bandwidth (float or str, default=1.0) – Bandwidth
\gammafor the Gaussian kernel applied toy. If"auto", selected using the median heuristic ony.alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy
0 < alpha < 1.eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when
log|f_new - f_old| - log|f_old| < log(eps_gd).
- Returns:
res – Dictionary containing:
par_v_init: initial regression coefficients.par_c_init: known variance.stepsize: initial step size.bandwidth: bandwidth used (resolved if"auto").estimator: last parameter iterate (point estimate).trajectory: parameter trajectory of shape(n_features, n_step_done + 1).convergence: 0 if converged, 1 if max iterations reached.
- Return type:
MMDResult
- regmmd.optimizers._gd_backtracking_lg_tilde_regression(X, y, par_v, n_step=1000, stepsize=1.0, bandwidth=1.0, alpha=0.8, eps_gd=1e-05)[source]#
Fit a LinearGaussian regression model via exact gradient descent with backtracking line search on the tilde MMD criterion with a Gaussian kernel.
Extends
_gd_backtracking_lg_loc_tilde_regression()to the full model where both the regression coefficientsbetaand the noise variancephiare optimized jointly. The variance is reparametrized aslog(phi)for unconstrained optimization.- Parameters:
X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Observed target values.
par_v (np.array, shape (n_features + 1,)) – Initial parameter vector
[beta_0, ..., beta_p, phi]wherephi > 0is the noise variance.n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Carried across iterations: once reduced by backtracking, the smaller value persists to the next step.
bandwidth (float or str, default=1.0) – Bandwidth
hfor the Gaussian kernel applied toy. If"auto", selected using the median heuristic ony.alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy
0 < alpha < 1.eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when
log|f_new - f_old| - log|f_old| < log(eps_gd).
- Returns:
res – Dictionary containing:
par_v_init: initial parameter vector[beta, phi].par_c_init:None(no fixed parameters).stepsize: initial step size.bandwidth: bandwidth used (resolved if"auto").estimator: last parameter iterate[beta, phi].trajectory: parameter trajectory of shape(n_features + 1, n_step_done + 1), stored in the original[beta, phi]space (not log-space).convergence: 0 if converged, 1 if max iterations reached.
- Return type:
MMDResult
- regmmd.optimizers._gd_backtracking_logistic_tilde_regression(X, y, par_v, n_step=1000, stepsize=1.0, bandwidth=1.0, kernel='Gaussian', alpha=0.8, eps_gd=1e-05)[source]#
Fit a Logistic regression model via exact gradient descent with backtracking line search on the tilde MMD criterion.
Computes the exact tilde MMD gradient analytically for a logistic model, avoiding Monte Carlo sampling entirely. Since \(Y | X \sim \text{Bernoulli}(p)\) with \(p = \text{sigmoid}(X^{\top}\beta)\), the expectations involving \(Y\) reduce to closed-form expressions in \(p_i\), enabling deterministic gradient descent with backtracking.
- Parameters:
X (np.array, shape (n_samples, n_features)) – Training input samples (design matrix, without intercept column).
y (np.array, shape (n_samples,)) – Binary target values in {0, 1}.
par_v (np.array, shape (n_features,)) – Initial value of the regression coefficients (beta).
n_step (int, default=1000) – Maximum number of gradient descent iterations.
stepsize (float, default=1.0) – Initial step size. Carried across iterations: once reduced by backtracking, the smaller value persists to the next step.
bandwidth (float or str, default=1.0) – Bandwidth for the kernel applied to
y. If"auto", selected using the median heuristic ony.kernel (str, default="Gaussian") – Kernel type. Supported options are
"Gaussian","Laplace", and"Cauchy".alpha (float, default=0.8) – Backtracking reduction factor. Must satisfy
0 < alpha < 1.eps_gd (float, default=1e-5) – Convergence tolerance on the relative change in the objective: stops when
log|f_new - f_old| - log|f_old| < log(eps_gd).
- Returns:
res – Dictionary containing:
par_v_init: initial regression coefficients.par_c_init:None(no fixed parameters).stepsize: initial step size.bandwidth: bandwidth used (resolved if"auto").estimator: last parameter iterate (point estimate).trajectory: parameter trajectory of shape(n_features, n_step_done + 1).convergence: 0 if converged, 1 if max iterations reached.
- Return type:
MMDResult
- regmmd.optimizers._gd_exact_logistic_hat_regression(X, y, par_v, par_c=None, kernel='Gaussian', kernel_x='Laplace', burn_in=500, n_step=1000, stepsize=1.0, bandwidth_y='auto', bandwidth_x='auto', c_det=0.2, c_rand=0.1, epsilon=0.0001, eps_sq=1e-05, rng=Generator(PCG64) at 0x7F22EA036960)[source]#
Fit a Logistic regression model using the hat estimator via AdaGrad with exact (analytical) gradients, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier, Gerber (2024).
Replaces the Monte Carlo Y-sampling in
_sgd_hat_regression()with closed-form gradient expressions for the logistic model. Since \(Y | X \sim \text{Bernoulli}(p)\) with \(p = \text{sigmoid}(X^{\top} \beta)\), the expectations \(\mathbb{E}[k_Y(Y_i, Y_j)]\) and \(\mathbb{E}[k_Y(Y_i, y_i)]\) reduce to simple functions of \(p_i\) and \(p_j\), so the gradient is computed analytically for each pair \((i, j)\).- Parameters:
X (np.array, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Binary target values in {0, 1}.
par_v (np.array, shape (n_features,)) – Initial regression coefficients (beta).
par_c (array or None, default=None) – Unused; included for API consistency.
kernel (str, default="Gaussian") – Kernel applied to y. Supported options are
"Gaussian","Laplace", and"Cauchy".kernel_x (str, default="Laplace") – Kernel applied to X. Supported options are
"Gaussian","Laplace", and"Cauchy".burn_in (int, default=500) – Number of burn-in AdaGrad iterations (parameters not averaged).
n_step (int, default=1000) – Maximum number of averaging iterations after burn-in.
stepsize (float, default=1.0) – Initial AdaGrad step size.
bandwidth_y (float or str, default="auto") – Bandwidth for the kernel on y. If
"auto", uses the median heuristic.bandwidth_x (float or str, default="auto") – Bandwidth for the kernel on X. If
"auto", uses the median heuristic.c_det (float, default=0.2) – Fraction of n determining the number of deterministic nearest pairs.
c_rand (float, default=0.1) – Fraction of n determining the number of randomly sampled distant pairs.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm for AdaGrad stability.
eps_sq (float, default=1e-5) – Convergence threshold; stops when
log(avg_grad_norm) < log(eps_sq).rng (np.random.Generator, default=np.random.default_rng(10)) – Random number generator for sampling distant pairs.
- Returns:
res – Dictionary containing:
par_v_init: initial regression coefficients.par_c_init:None.stepsize: step size used.bandwidth_y: bandwidth used for y.bandwidth_x: bandwidth used for X.estimator: Polyak-Ruppert cumulative average of iterates.trajectory: cumulative-average trajectory of shape(n_features, n_step_done).convergence: 0 if converged, 1 if max iterations reached, -1 if NaN.
- Return type:
MMDResult
- regmmd.optimizers._gd_gaussian_loc_exact_estimation(X, par_v, par_c, burn_in=500, n_step=1000, stepsize=1.0, bandwidth=1.0, epsilon=0.0001)[source]#
Estimate the location parameter of a Gaussian model with Gaussian kernel using exact MMD gradient descent.
Computes the exact MMD gradient analytically for a Gaussian location model with known scale parameter
par_cand a Gaussian Kernel, avoiding Monte Carlo sampling. Uses AdaGrad updates with a burn-in phase followed by Polyak-Ruppert averaging.- Parameters:
X (np.array, shape (n_samples,)) – Univariate observed data.
par_v (float) – Initial value of the location parameter (mean) to be estimated.
par_c (float) – Known scale parameter (standard deviation) of the Gaussian model.
burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not averaged.
n_step (int, default=1000) – Number of averaging iterations following the burn-in phase.
stepsize (float, default=1.0) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth parameter for the Gaussian kernel. If
"auto", the bandwidth is selected using the median heuristic.epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
- Returns:
res – Dictionary containing:
par_v_init: initial location parameter.par_c_init: initial scale parameter.stepsize: step size used.bandwidth: bandwidth used (resolved if"auto").estimator: Polyak-Ruppert average of location parameter iterates.trajectory: parameter trajectory of shape(burn_in + n_step + 1,).
- Return type:
MMDResult
- regmmd.optimizers._get_grad_estimate(set_1, set_2, X, K_X, y_sampled_1, y_sampled_2, y, model, kernel_y, bandwidth_y)[source]#
Compute a partial gradient estimate for the hat estimator objective.
Evaluates the gradient contribution from a specified subset of observation pairs \((i, j)\). When
set_1andset_2are provided, the gradient is weighted by the covariate kernel \(k_X\) evaluated at those pairs. When both areNone, the diagonal term \((i = j)\) is computed without covariate kernel weighting.- Parameters:
set_1 (np.array[np.int32] or None) – Row indices of the first element of each pair. If
None, the diagonal \((i = j)\) contribution is computed.set_2 (np.array[np.int32] or None) – Row indices of the second element of each pair. If
None, the diagonal \((i = j)\) contribution is computed.X (np.array, shape (n_samples, n_features)) – Training input samples.
K_X (np.array or None) – Precomputed covariate kernel values for the pairs defined by
set_1andset_2. Ignored whenset_1isNone.y_sampled_1 (np.array, shape (n_samples,)) – First set of samples drawn from the model’s conditional distribution.
y_sampled_2 (np.array, shape (n_samples,)) – Second set of samples drawn from the model’s conditional distribution.
y (np.array, shape (n_samples,)) – Observed target values.
model (RegressionModel) – The regression model. Must implement
score.kernel_y (str) – Kernel type applied to the target variable
y.bandwidth_y (float) – Bandwidth for the kernel applied to
y.
- Returns:
grad_estimate – Gradient estimate contributions from the specified pairs. Shape is
(n_params,)whenset_1isNone(diagonal term), or(len(set_1), n_params)for off-diagonal pairs.- Return type:
np.array
- regmmd.optimizers._median_heuristic(X)[source]#
Compute the median heuristic for kernel bandwidth selection.
Estimates the bandwidth as the median of pairwise Euclidean distances divided by sqrt(2), a common data-driven heuristic for kernel methods.
- Parameters:
X (np.array, shape (n_samples,) or (n_samples, n_features)) – Input data.
- Returns:
median_dist – Estimated bandwidth. Returns 1 if fewer than two samples are provided.
- Return type:
float
- regmmd.optimizers._sgd_estimation(X, par_v, par_c, model, kernel, burn_in=500, n_step=1000, stepsize=1, bandwidth=1.0, epsilon=0.0001)[source]#
Estimate model parameters via AdaGrad stochastic gradient descent on the MMD criterion.
Minimizes the MMD between the empirical distribution of
Xand the model’s distribution using a stochastic gradient algorithm. The optimization runs a burn-in phase (without averaging) followed by a main phase with Polyak-Ruppert averaging of the iterates.- Parameters:
X (np.array, shape (n_samples,) or (n_samples, n_features)) – Observed data used to fit the model.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (np.array) – Constant model parameters that are not optimized.
model (EstimationModel) – The statistical model to fit. Must implement
sample_n,score,update, and_project_params.kernel (str) – Kernel type used for the MMD computation. Supported options are
"Gaussian","Laplace", and"Cauchy".burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not averaged.
n_step (int, default=1000) – Number of averaging iterations following the burn-in phase.
stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth parameter for the kernel. If
"auto", the bandwidth is selected using the median heuristic.epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
- Returns:
res – Dictionary containing:
par_v_init: initial variable parameters.par_c_init: initial constant parameters.stepsize: step size used.bandwidth: bandwidth used (resolved if"auto").estimator: Polyak-Ruppert average of parameter iterates.trajectory: parameter trajectory of shape(n_params, burn_in + n_step + 1).
- Return type:
MMDResult
- regmmd.optimizers._sgd_hat_regression(X, y, par_v, par_c, model, kernel_y, kernel_x='Laplace', burn_in=500, n_step=1000, stepsize=1, bandwidth_y='auto', bandwidth_x='auto', c_det=0.2, c_rand=0.1, epsilon=0.0001, eps_sq=1e-05, rng=Generator(PCG64) at 0x7F22EA0367A0)[source]#
Fit a regression model using the hat estimator via stochastic gradient descent, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier, Gerber (2024).
Minimizes the MMD objective using the product kernel \(k = k_X \otimes k_Y\). The gradient is approximated efficiently by splitting pairs \((X_i, X_j)\) into three groups: all diagonal pairs, the M_det closest off-diagonal pairs (deterministic), and M_rand randomly selected distant pairs (stochastic). This yields a gradient estimator with cost linear in n per iteration.
Compared to the tilde estimator, this estimator is robust to adversarial contamination of the training data, but is computationally more expensive due to the preprocessing of pairwise distances. See Alquier and Gerber (2024), Section 4.
- Parameters:
X (np.arrau, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Target values.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (np.array) – Constant model parameters that are not optimized.
model (RegressionModel) – The regression model to fit. Must implement
predict,sample_n,score,update, and_project_params.kernel_y (str) – Kernel type applied to the target variable
y. Supported options are"Gaussian","Laplace", and"Cauchy".kernel_x (str, default="Laplace") – Kernel type applied to the covariates
X. Supported options are"Gaussian","Laplace", and"Cauchy".burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not included in the running average.
n_step (int, default=1000) – Maximum number of averaging iterations following the burn-in phase. Early stopping is applied if the average gradient norm falls below
eps_sq.stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth_y (float or str, default="auto") – Bandwidth for the kernel applied to
y. If"auto", selected using the median heuristic.bandwidth_x (float or str, default="auto") – Bandwidth for the kernel applied to
X. If"auto", selected using the median heuristic.c_det (float, default=0.2) – Fraction of n used to determine the number of deterministic (closest) off-diagonal pairs included per iteration.
c_rand (float, default=0.1) – Fraction of n used to determine the number of randomly selected distant pairs included per iteration.
epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
eps_sq (float, default=1e-5) – Convergence threshold for early stopping. Optimization stops when the log of the average gradient norm falls below
log(eps_sq).rng (np.random.Generator, default=np.random.default_rng(10)) – Random number generator used for sampling distant pairs.
- Returns:
res – Dictionary containing:
par_v_init: initial variable parameters.par_c_init: initial constant parameters.stepsize: step size used.bandwidth_y: bandwidth used fory(resolved if"auto").bandwidth_x: bandwidth used forX(resolved if"auto").estimator: Polyak-Ruppert average of parameter iterates.trajectory: cumulative-average parameter trajectory of shape(n_params, n_step_done).
- Return type:
MMDResult
- regmmd.optimizers._sgd_tilde_regression(X, y, par_v, par_c, model, kernel, burn_in=500, n_step=1000, stepsize=1, bandwidth=1.0, epsilon=0.0001, eps_sq=1e-05)[source]#
Fit a regression model using the tilde estimator via stochastic gradient descent, as described in Universal Robust Regression via Maximum Mean Discrepancy, Alquier and Gerber (2024).
Minimizes the MMD objective using only the kernel \(k_Y\) on the target variable. The gradient is computed using pairs of samples drawn from the model’s conditional distribution at the observed \(X\) values, giving a cost linear in \(n\) per iteration.
Compared to the hat estimator, this estimator is computationally cheaper as it avoids the \(O(n^2)\) preprocessing of pairwise covariate distances, but enjoys slightly weaker robustness guarantees, see Section 4, Alquier, Gerber (2024)
- Parameters:
X (np.array, shape (n_samples, n_features)) – Training input samples.
y (np.array, shape (n_samples,)) – Target values.
par_v (np.array) – Initial values of the variable (optimized) model parameters.
par_c (np.array) – Constant model parameters that are not optimized.
model (RegressionModel) – The regression model to fit. Must implement
predict,sample_n,score,update, and_project_params.kernel (str) – Kernel type applied to the target variable
y. Supported options are"Gaussian","Laplace", and"Cauchy".burn_in (int, default=500) – Number of burn-in iterations during which parameter iterates are not included in the running average.
n_step (int, default=1000) – Maximum number of averaging iterations following the burn-in phase. Early stopping is applied if the average gradient norm falls below
eps_sq.stepsize (float, default=1) – Initial step size for the AdaGrad update.
bandwidth (float or str, default=1.0) – Bandwidth for the kernel applied to
y. If"auto", selected using the median heuristic onX.epsilon (float, default=1e-4) – Initial accumulated squared gradient norm, used to stabilize the AdaGrad step size at the start of optimization.
eps_sq (float, default=1e-5) – Convergence threshold for early stopping. Optimization stops when the log of the average gradient norm falls below
log(eps_sq).
- Returns:
res – Dictionary containing:
par_v_init: initial variable parameters.par_c_init: initial constant parameters.stepsize: step size used.bandwidth: bandwidth used (resolved if"auto").estimator: Polyak-Ruppert average of parameter iterates.trajectory: cumulative-average parameter trajectory of shape(n_params, n_step_done).
- Return type:
MMDResult
- regmmd.optimizers.sort_obs(X)[source]#
Sort all pairs of observations by their pairwise Euclidean distance.
Computes pairwise distances between all rows of \(X\) and returns the upper-triangular index pairs and corresponding distances, sorted from closest to most distant. Used to efficiently select the nearest pairs in the hat estimator gradient computation.
- Parameters:
X (np.array, shape (n_samples, n_features)) – Input data whose rows are the observations to be paired.
- Returns:
result – Dictionary with two keys:
"DIST": np.array of shape(n*(n-1)//2,), pairwise distances sorted in ascending order."IND": np.array of shape(n*(n-1)//2, 2), row index pairs(i, j)withi < jcorresponding to each distance in"DIST".
- Return type:
dict