Principle of kernel-based sampling methods

This section presents kernel-based sampling methods developped in otkerneldesign for design of experiments, numerical integration and quantization.

Introduction.

Let us denote by \vect{X}_n = \left\{\vect{x}^{(1)},\ldots, \vect{x}^{(n)}\right\} \in \cD_\vect{X} \subset \Rset^p the n-sample of input realizations, also called input "design of experiments" (DoE) or simply design.

From a computer experiment point of view, let us consider a costly function g:\cD_\vect{X} \rightarrow \Rset, a first goal of the following DoE methods is to explore the input space in a space-filling way (e.g., to build a generic machine learning model of g). However, this work will exploit these methods with a specific purpose in mind: to numerically integrate g against the probability density function f_{\vect{X}} which relates to a central tendency estimation of an output random variable Y=g(\vect{X}), resulting from an uncertainty propagation step.

Sampling methods based on the notion of discrepancy between distributions in a kernel-based functional space were used to approximate integrals. More precisely, one can mention the use of the distance called the maximum mean discrepancy (MMD) as a core ingredient of advanced sampling methods such as the Support points by (Mak & Joseph, 2018) and the Kernel herding by (Chen & al., 2010). Manipulating the MMD is convenient since its expression is simple ; it depends on an arbitrarily chosen kernel. Let us setup the introduction of the Kernel herding and Support points methods by briefly defining a few mathematical concepts.

Reproducing kernel Hilbert space.

Assuming that k is a symmetric and positive definite function k: \cD_\vect{X} \times \cD_\vect{X} \rightarrow \Rset, latter called a "reproducing kernel" or simply a "kernel". A reproducing kernel Hilbert space (RKHS) is an inner product space \cH(k) of functions g:\cD_\vect{X} \rightarrow \Rset with the following properties:

  • k(\cdot, \vect{x}) \in \cH(k), \quad \forall \vect{x} \in \cD_\vect{X}.

  • \langle g, k(\cdot, \vect{x}) \rangle_{\cH(k)} = g(\vect{x}), \quad \forall \vect{x} \in \cD_\vect{X}, \forall g \in \cH(k)

Note that for a defined reproducing kernel, a unique RKHS exists and vice versa (see C.Oates, 2021 ).

Potential.

For any target distribution \mu, its potential (also called "kernel mean embedding") associated with the kernel k is defined as:

(1)P_{\mu}(\vect{x}) := \int_{\cD_\vect{X}} k(\vect{x}, \vect{x}') \di \mu(\vect{x}').

Then, the potential of a discrete distribution \zeta_n = \frac1n \sum_{i=1}^{n} \delta(\vect{x}^{(i)}) (uniform mixture of Dirac distributions at the design points \vect{X}_n) associated with the kernel k can be expressed as:

(2)P_{\zeta_n}(\vect{x}) = \frac1n \sum_{i=1}^{n} k(\vect{x}, \vect{x}^{(i)}).

Close potentials can be interpreted to mean that the design \vect{X}_n adequately quantizes \mu

Maximum mean discrepancy.

A metric of discrepancy and quadrature error is offered by the MMD. This distance between two distributions \mu and \zeta is given by the maximal quadrature error committed for any function within the unit ball of an RKHS:

(3)\mathrm{MMD}_k(\mu, \zeta) :=
\sup_{\lVert g \lVert_{\cH(k)} \leq 1}
        \left | \int_{\cD_{\vect{X}}} g(\vect{x}) \di \mu(\vect{x}) - \int_{\cD_{\vect{X}}} g(\vect{x}) \di \zeta(\vect{x}) \right|.

Using the Cauchy-Schwartz inequality, one can demonstrate that \mathrm{MMD}_k(\mu, \zeta) = \left\lVert P_{\mu}(\vect{x}) - P_{\zeta}(\vect{x}) \right\lVert_{\cH(k)} (see C.Oates, 2021 ). Moreover, a kernel k is called characteristic if \mathrm{MMD}_k(\mu, \zeta) = 0 is equivalent to \mu = \zeta.

Kernel herding

In this section we introduce the Kernel herding (KH) (Chen & al., 2010), a sampling method which intends to minimize a squared MMD by adding points iteratively. Considering a design \vect{X}_n and its corresponding discrete distribution \zeta_n= \frac{1}{n} \sum_{i=1}^{n} \delta(\vect{x}^{(i)}), a KH iteration can be written as an optimization over the point \vect{x}^{(n+1)} \in \cD_{\vect{X}} of the following criterion:

(4)\vect{x}^{(n+1)} \in \argmin_{\vect{x} \in \mathcal{S}} \left(P_{\zeta_n}(\vect{x}) - P_{\mu}(\vect{x})\right)

considering a kernel k and a given set \cS\subseteq\cD_{\vect{X}} of candidate points (e.g., a fairly dense finite subset with size N \gg n that emulates the target distribution). This compact criterion derives from the expression of a descent algorithm with respect to \vect{x}_{n+1} (see (Pronzato & Zhigljavsky, 2020) for the full proof).

In practice, P_{\mu}(\vect{x}) can be expressed analytically in the specific cases of input distribution and kernel (e.g., for independent uniform or normal inputs and a Matérn 5/2 kernel (Fekhari & al., 2021)), making the computation very fast. Alternatively, the potential can be evaluated on an empirical measure \mu_N, substituting \mu, formed by a dense and large-size sample of \mu (e.g., the candidate set \mathcal{S}). P_{\mu}(\vect{x}) is then approached by P_{\mu_N}(\vect{x}) = (1/N)\, \sum_{j=1}^N k(\vect{x}, \vect{x}'^{(j)}), which can be injected in (4) to solve the following optimization:

(5)\vect{x}^{(n+1)} \in \argmin_{\vect{x}\in\mathcal{S}} \left( \frac{1}{n} \sum_{i=1}^{n} k(\vect{x},\vect{x}^{(i)})
  - \frac{1}{N} \sum_{j=1}^N k(\vect{x},\vect{x}'^{(j)}) \right) \,.

When no observation is available, which is the common situation at the design stage, the kernel hyperparameters (e.g., correlation lengths) have to be set to heuristic values. MMD minimization is quite versatile and was explored in details by (Teymur & al., 2021) or (Pronzato & Zhigljavsky, 2020), however the method is very sensitive to the kernel chosen and its tuning. Support points is a closely related method with a more rigid mathematical structure but interesting performances.

Greedy support points

Support points (SP) (Mak & Joseph, 2018) are such that their associated empirical distribution \zeta_n has minimum energy distance with respect to a target distribution \mu. This criterion can be seen as a particular case of the MMD for the characteristic "energy-distance" kernel given by:

(6)k_E(\vect{x},\vect{x}') = \frac{1}{2}\, \left(\| \vect{x} \| + \| \vect{x}' \| - \| \vect{x}-\vect{x}' \|\right)\,.

Compared to more heuristic methods for solving quantization problems, Support points benefit from the theoretical guarantees of MMD minimization in terms of convergence of \zeta_n to \mu as n\to\infty.

At first sight, this optimization problem seems intractable, although (Mak & Joseph, 2018) propose to rewrite the function as a difference of convex functions in \vect{X}_n, which yields a difference-of-convex program. To simplify the algorithm and keep an iterative design, a different approach will be used here. At iteration n+1, the algorithm solves greedily the MMD minimization between \zeta_n and \mu for the candidate set \mathcal{S}:

(7)\vect{x}^{(n+1)} \in \argmin_{\vect{x}\in\mathcal{S}} \Bigg( \frac{1}{N} \sum_{j=1}^N \|\vect{x}-\vect{x}'^{(j)}\|
- \frac{1}{n+1} \sum_{i=1}^{n} \|\vect{x}-\vect{x}^{(i)}\| \Bigg) \,.

For this criterion, one can notice that it is almost identical to the KH one in (4) when taking as kernel the energy-distance kernel given in (6). These two iterative methods were exploited in (Fekhari & al., 2021) to study new ways to construct a validation set for machine learning models by conveniently selecting a test set for a better model performance estimation.

References

  • Chen, Y., M. Welling, & A. Smola (2010). Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pp. 109 – 116. PDF

  • Mak, S. & V. R. Joseph (2018). Support points. The Annals of Statistics 46, 2562 – 2592. PDF

  • Fekhari, E., B. Iooss, J. Mure, L. Pronzato, & M. Rendas (2022). Model predictivity assessment: incremental test-set selection and accuracy evaluation. preprint. PDF

  • Briol, F.-X., C. Oates, M. Girolami, M. Osborne, & D. Sejdinovic (2019). Probabilistic Integration: A Role in Statistical Computation? Statistical Science 34, 1 – 22. PDF

  • Pronzato, L. & A. Zhigljavsky (2020). Bayesian quadrature and energy minimization for space-filling design. SIAM/ASA Journal on Uncertainty Quantification 8, 959 – 1011 PDF

  • Huszár, F. & D. Duvenaud (2012). Optimally-Weighted Herding is Bayesian Quadrature. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pp. 377 – 386. PDF

  • Teymur, O., J. Gorham, M. Riabiz, & C. Oates (2021). Optimal quantisation of probability measures using maximum mean discrepancy. In International Conference on Artificial Intelligence and Statistics, pp. 1027 – 1035. PDF