Needless to say, Gaussian Process (GP) is an important nonparametric model in ML that has applications in numerous fields, nonetheless, I guess there needs to be some motivations from my own point of view in the beginning of this article.

The main exposure to this topic for me is through Spatial statistics, which is probably the most obvious one to relate: on a 2-dimensional plane such as a bounded domain on the planet Earth (or 3 if you add time), in a lot of cases it is reasonable to abstract data that occurs with geological influences (correlations) as spatial processes. Among others, GP is probably the most widely-applied model for continuous data (with also extensions to count/binary data via link functions, such as Log-Gaussian Cox Process), owing to its nice properties that unify theoretical sophistication and computational tractability. In particular, the joint distribution of random variables defined on any subset of points in the domain is multivariate Gaussian, where the finite dimensional Gram matrix is a realisation of the Covariance function of the GP. Together with the fact that Gaussian distribution is conjugate to itself allows for the prediction at any point in the domain with finite data observed.

Of course, it is impossible to know the exact second order structure of the GP (or that if it is stationary or not), and instead what is done in the most classic way is to assume a class of priors that has closed form (second order stationary/homogenous) covariance functions with hyperparameters to be inferred, and a linear regression model as the mean to model the trend. The prediction is then carried out through what is called kriging in geostatistics, a.k.a. GP regression. Of course, it is possible to further relax the assumption of stationarity of the Covariance function to allow for more complicated second order structure [3], which probably varies over time, spatial domain or both (e.g., in the case of many environmental problems under climate change).

Many literatures on GP nowadays focus on addressing its scalability issues: for $N$ observations, the computational complexity and memory requirement for computing the posterior (predicting) is $\calO(N^3)$ and $\calO(N^2)$ respectively in inverting and storing the Gram matrix. Various views of GP offers motivations for different approaches in tackling the issue. For reviews of scalable GP developed by spatial statistics and ML communities, please refer to [4, 6].

In the sequel, we will have a look at different definitions of Gaussian field/process with a non-rigorous introduction, then a fully theoretical setup. In particular, we make connections between generalised Gaussian random fields to RKHS, which will be similar to what we have seen before (by identifying a compact self-adjoint non-negative operator). In fact, under such circumstances, we are able find a Gaussian random field being a solution to some Elliptical SPDE driven by Gaussian white noise. However, we will delay the discussion to the next post, where the extension from bounded Euclidean domain to compact Riemannian manifold is also presented.

None-rigorous intro to GP

This section serves as a less rigorous presentation of the basics of GP in [1], where we will see later the function space view and weight space view are intuitive abstraction of the formal definitions as in [2].

Let $\calX$ be some metric space, and $f: \calX \to \R$ be a random function, then $f\sim \mathrm{GP}(m, k)$ is a Gaussian process if for any $\mathbf{x}\in \calX^n$, $\mathbf{f} = f(\mathbf{x}) \sim \calN(\mathbf{\mu}, K_{\mathbf{xx}})$, where $\mathbf{\mu} = m(\mathbf{x})$ and $K_{\mathbf{xx}} = k(\mathbf{x}, \mathbf{x})$, i.e. $\mathbf{f}$ has a multivariate Gaussian distribution. In addition, $f$ is completely defined by its mean function and covariance function $k$. WLOG, we assume $\mu\equiv 0$ in the following discussion.

An alternative view from [1] also gives a natural characterisation of GP, which is termed weight space view. Intuitively speaking, it can be regarded as a generalisation of the Bayesian linear regression (BLR) model, by first transforming the data via a projection $\mathbf{\Phi}$ (consisting of a set of $m$ basis functions, which can be seen as a feature map) into a higher dimensional space preceding the construction of the BLR model. In particular,

$$ \begin{align*} f(X) &= \mathbf{\Phi}(X)^T \mathbf{w}, \;\mathbf{y} = f(X) + \mathbf{\epsilon}\\ \mathbf{w} &\sim \calN(0, \Sigma_p), \;\;\mathbf{\epsilon} \sim \calN(0, \sigma^2I_p) \end{align*} $$

where $X$ is the design matrix for a set of sample $\{\mathbf{x}_i\}_{i=1}^p \subset \calX^n$.

For example, consider a 1-layer NN, with logistic activation function $\{\phi_i\}_i$, which can be formulated as followed

$$ f(X) = \theta_0 + \sum_j^m \theta_j\phi_j(X^T\mathbf{\eta_j}) $$

This converges to GP as $m\to\infty$ as shown by Neal [5]. In addition, we can also employ Harmonic functions, Waveletts, piecewise linear functions (in FEM) etc. as the basis functions, which allows for the approximation of Covariance $k$ (with truncated basis).

Gaussian Field and Covariance Operator

We start with the basic definition of Gaussian Field, and make connections to the preceding section. Most of the content would be based on [2] with details added, and some changes of notations and phrasing so that they adheres to previous posts.

Recall one way to define a real-valued random variable $\eta$ is by a measurable mapping from a probability space $(\Omega, \calF, \P)$ to $(\R, \calB(\R))$ with its probability distribution defined as $\mu_{\eta}(A)=\P(\eta\in A) = \P(\omega: \eta(\omega)\in A)$ for any $A$ in the Borel sigma algebra of the real line. On the other hand, we can start with any Lebesgue-Stieltjes measure on $(\R, \calB(\R))$ coupled with a cumulative distribution function $F$, such that one is able to define a probability distribution $\P((a,b])=F(b) - F(a)$. Then the random variable $\eta$ identified with $\eta(x)=x$ for any $x\in \R$ has an induced measure $\mu_X=\P$ on $(\R, \calB(\R))$.

Similarly this extend to random elements with values in topological spaces. In particular, consider a measurable space $(V, \calB(V))$, where $V$ is a linear topological space. Then, a $V$-valued random variable $X$ induces a probability measure on $(V, \calB(V))$, where

$$ \mu(A) = \P(X\in A), \;\; A\in \calB(V) $$

Conversely, if $\P$ is a probability measure on $(V, \calB(V))$, then $X(v)=v$ defines a canonical $V$-valued random element on the probability space $(V, \calB(V), \P)$ such that $\mu_X=\P$. If $V$ consists of functions of time, $X$ is called a canonical process, e.g. the canonical process under Wiener measure on the space $V=\{f\in\calC([0, T]), f(0)=0\}$ equipped with sup norm, is the standard Brownian motion.

We know the Gaussian measure $\mu$ on $(\R, \calB(\R))$ is identified with its characteristic function (i.e. Fourier transform) with real mean $m$ and $\sigma >0$, such that

$$ \int_{\R} e^{itx} \mu(dx) = e^{itm-\frac{1}{2}t^2\sigma^2}, \;\; \forall t\in \R; $$

$\mu$ is called centered if $m=0$; a Standard Gaussian random variable has Gaussian measure $\mu$ with $m=0, \sigma=1$. To make extension to topological spaces, We’ll see the Gaussian measure defined on $(V, \calB(V))$ is given by its measurable dual space $V’$.

Definition 1 (Gaussian Measure, Def 3.2.1 [2]). A probability measure $\mu$ is called Gaussian if for every linear functional $\ell\in V’$ such that $\ell:V\to \R$, the probability measure $\ell^{\ast}\mu$ on $(\R, \calB(\R))$ defined by $\ell^{\ast}\mu((a,b]) = \mu\{v\in V: \ell(v)\in(a,b]\}$ is Gaussian. In addition, $\mu$ is called centered if $\ell^{\ast} \mu$ is centered for any $\ell\in V’$.

Moreover, for centered Gaussian measure $\mu$, one is able to define a Covariance function $C$ (which is positive definite) on $V’$ by

$$ \begin{align} \label{Eq:cov} C_{\mu}(\ell, h) = \int_V \ell(v)h(v)\;\mu(dv) \end{align} $$

On the other hand, $C_{\mu}$ is an operator $\ell \mapsto C_{\mu}(\ell, \cdot)$ from $V$ to $(V’)’$. If we assume $V=\calH$ a Hilbert space identified with its dual, i.e. $\calH=\calH’$, where

$$ \begin{align*} C_{\mu}: H &\to H\\ \ell &\mapsto C_{\mu}(\ell, \cdot) := \ell_{\mu}, \end{align*} $$

then $C_{\mu}(\ell, h) = \inner{\ell_{\mu}, h}_{\calH}$ by Riesz representation theorem. Later, we will see that for separable $\calH$, it is able to identify an integral operator $\calK$ (confusingly, which is also called the Covariance operator) with a corresponding reproducing kernel on $\R$, thus giving rise to a RKHS for Gaussian measure $\mu$. The following shows that any such $\calK$ is in fact nuclear (or of trace class), and corresponds to a Covariance function $C_{\mu}$ on some Gaussian measure $\mu$.

Theorem 1 (Thm 3.2.5 [2]). Let $\calH$ be a separable Hilbert space identified with its dual, then

  1. If $\mu$ is a Gaussian measure on $\calH$, then its covariance operator is nuclear.
  2. Conversely, if $\calK$ is a self-adjoint non-negative nuclear operator, then there exists a Gaussian measure $\mu$ on $\calH$ such that $C_{\mu} = \inner{\calK f, g}_{\calH}$ for all $f,g\in\calH$.

Notably, in the case of 2., since any nuclear operator is compact, for the set of eigenfunctions $\{\phi_k\}_{k\geq 1}$ of $\calK$, a CONS for $\calH$, with eigenvalues $\{\lambda_k\}_{k\geq 1}$ such that $\calK m_k = \lambda_k m_k$, we are able to define a Gaussian measure $\mu$ as the probability distribution of the random variable $X$, where

$$ X = \sum_{k\geq 1} \sqrt{\lambda_k} \xi_k m_k $$

and $\{\xi_k\}_{k\geq 1}$ is a set of i.i.d. standard Gaussian random variables. $\{\lambda_k\}_{k\geq 1}$ can also be viewed as the variances for each of $\xi_k$. This corresponds to the weight space view of GP as in the last section (indeed, we may have $X\sim GP(m, k)$). Another equivalent definition for a Gaussian random element corresponding to the function space view in [1] is as followed,

Definition 2 (Gaussian field, Def 3.2.6 [2]). A Gaussian field on $G\subset \R^d$ is a collection of random variables $X = X(P), P\in G$, such that for any $\{P_i\}_{i=1}^n \subset G$, $(X(P_i))_i$ forms a Gaussian random vector.

Similarly, a zero mean Gaussian field satisfies $\E X(P) = 0\; \forall P\in G$, and is characterised by its covariance function

$$ C(P, Q) = \E[X(P)X(Q)]. $$

Now, equipped with fundamental understandings of Gaussian random element in topological spaces, we give a general definition for Gaussian random fields, which can be regarded as a continuous linear map from $V$ to the space of random variables.

Definition 3 (Generalised random field, Def 3.2.9 [2]). _ A generalised field $\frakX$ over a linear topological space $V$ is a collection of random variables $\{\frakX(v), v\in V\}$ with following properties_

  1. $\frakX(au+ bv) = a\frakX(u) + b\frakX(v)$, $a,b\in\R,\;\; u,v \in V$;
  2. if $\lim_{n\to\infty} v_n = v$ in the topology of $V$, then $\lim_{n\to \infty}\frakX(v_n) = \frakX(v)$ in probability.

In particular, an equivalent definition for generalised Gaussian field over a Hilbert space serves as an alternative to Definition 3.

Definition 4 (Generalised Gaussian field, Def 3.2.10 [2]). A zero mean generalised Gaussian field $\frakX$ over a Hilbert space $\calH$ is a collection of Gaussian random variables $\{\frakX (f): f\in\calH\}$ with

  1. $\E\frakX (f) = 0\;\forall f\in \calH$;
  2. There exists a bounded, linear, self-adjoint, non-negative $\calK$ on $\calH$ such that
$$ \E[\frakX (f)\frakX (g)] = \inner{\calK f, g}_{\calH} $$

This is of similar spirit to Definition 1, where the RHS is Covariance operator for Gaussian random variables. In particular, when $\calK = \Id$, $\frakX$ is the Gaussian white noise on $\calH$. One can show that for separable Hilbert space $\calH$ with CONS $\{m_k\}_{k\geq 1}$, $\{\frakX(m_k)\}_{k\geq 1}$ is a collection of i.i.d. standard Gaussian r.v.

As mentioned in the comment for Theorem 1, there is a corresponding $\calK$ of trace class to a Gaussian measure $\mu$ (or a Gaussian random element $X$) on separable $\calH$, a similar result concerning the regularity of the generalised Gaussian field $\frakX$ exists. In particular, we choose to work with $\calL^2(G), \; G\subset \R^d$.

Definiton 5 (Regularity, Def 3.2.13 [2]). A generalised field $\frakX$ over Hilbert space $\calH$ is regular if there exists an $\calH$-valued random element $X\in \calL^2$, i.e. $\E\norm{X}^2_{\calH}<\infty$ such that

$$ \frakX(f) = \inner{X, f}_{\calH}\;\; \forall f\in\calH. $$

Theorem 2 (Thm 3.2.15 [2]). A generalised field $\frakX$ over a separable Hilbert space $\calL^2(G)$ where $G\subset \R^d$ is regular iff $\calK$ of $\frakX$ is nuclear.

Proof. $(\Rightarrow)$, assume $\frakX(f) = \inner{X, f}_{\calH}$, where $X\in\calL^2$, and let $\{m_k\}_k$ be a CONS in $\calH$, then

$$ \sum_k \inner{\calK m_k, m_k}_{\calH} = \sum_k \E\inner{X, m_k}^2 = \E\norm{X}^2_{\calH} < \infty $$

by definiton of $\frakX$ and Parseval’s identity for $X$. Clearly this implies that $\calK$ is nuclear.

$(\Leftarrow)$, assume now $\calK$ is nuclear (thus compact), and $\{\phi_k\}_k$ the set of eigenfunctions of $\calK$ (CONS of $\calH$), with $\{\lambda_k\}_k$ the set of eigenvalues of $\calK$. Define \emph{i.i.d.} standard Gaussian r.v. in the following way,

$$ \xi_k = \frac{1}{\sqrt{\lambda_k}}\frakX(\phi_k) $$

where $\lambda_k$ is the variance for $\frakX(m_k)$. Similarly as in Theorem 1, we can construct the $\calH$-valued random variable $X$ that satisfies $\frakX(f) = \inner{X, f}_{\calH}$ in the following way, which is called Karhunen-Loève expansion.

$$ X = \sum_k \inner{X, \phi_k}_{\calH}\phi_k = \sum_k \frakX(\phi_k)\phi_k = \sum_k \sqrt{\lambda_k}\xi_k \phi_k $$

Indeed, $\E\norm{X}^2_{\calH} = \sum_k \lambda_k <\infty$ when $\calK$ is nuclear. Finally, what is left to show is that such $X$ indeed gives a Covariance operator with $\calK$ satisfying $\E[\frakX(f)\frakX(g)] = \inner{\calK f, g}_{\calH}$ for any $f,g\in\calH$. In particular, denote $f_k = \inner{f, \phi_k}_{\calH}$, then $f = \sum_k f_k \phi_k$, similarly for $g$. Now,

$$ \begin{align*} \E[\frakX(f)\frakX(g)] &= \E[\inner{X, f}_{\calH}\inner{X, g}_{\calH}]\\ &= \sum_k\sum_j f_k g_j \E[\inner{X, \phi_k}\inner{X, \phi_j}]\\ &= \sum_k f_k g_k \int\int \E[X(x)X(y)] \phi_k(x) \phi_k(y) \;dx dy\\ &= \sum_k f_k g_k \inner{\calK \phi_k, \phi_k}\\ &= \sum_k \lambda_k f_k g_k = \inner{\calK f, g} \end{align*} $$

where $k(x,y):= \E[X(x)X(y)]$ is the reproducing kernel (covariance function), so

$$ \calK f(x) = \int k(x,y) f(y) \;dy $$

is the Covariance operator for Gaussian measure $\mu$.

$\square$

It is to note that, the RKHS $H_{\mu}$ for Gaussian measure $\mu$ can be found with inner product defined as

$$ \inner{f, g}_{H_{\mu}} = \inner{\calT_k^{\ast} f, \calT_k^{\ast} g}_{\calH}, \;\;f,g\in H_{\mu} $$

where $\calK = \calT_k \calT_k^{\ast}$, and $\calT_k: H_{\mu} \to \calL^2(G)$ is an isomorphism (similarly as in the consequences of Mercer’s Theorem).

Remark. This definition of generalised Gaussian random field enabled a natural way of finding/constructing the RKHS corresponding to the solution of Elliptical SPDEs (which is of similar idea to Green’s function). This leverages the spectral properties of the Laplace(-Beltrami) operator on appropriately chosen domains, such as compact Riemannian manifolds with no boundary. By truncating the infinite series (ordered by the variance/eigenvalues), we are able to obtain a global approximation to the random field that is less prone to noises (since the high frequency elements are filtered out).

References

  1. Rasmussen CE. Gaussian processes in machine learning. InSummer School on Machine Learning 2003 Feb 2 (pp. 63-71). Springer, Berlin, Heidelberg.
  2. Lototsky SV, Rozovsky BL. Stochastic partial differential equations. New York: Springer; 2017 Jul 6.
  3. Fuglstad GA, Simpson D, Lindgren F, Rue H. Constructing priors that penalize the complexity of Gaussian random fields. Journal of the American Statistical Association. 2019 Jan 2;114(525):445-52.
  4. Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics. 2019 Sep 15;24(3):398-425.
  5. Neal RM. Bayesian learning for neural networks. Springer Science & Business Media; 2012 Dec 6.
  6. Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: A review of scalable GPs. IEEE Transactions on Neural Networks and Learning Systems. 2020 Jan 7.