Neural network Gaussian process – Wikipedia

Posted on March 17, 2018 by lordneo

before-content-x4

Modeling tool for assigning probabilities to events

Left: a Bayesian neural network with two hidden layers, transforming a 3-dimensional input (bottom) into a two-dimensional output

{displaystyle (y_{1},y_{2})}

Bayesian networks are a modeling tool for assigning probabilities to events, and thereby characterizing the uncertainty in a model’s predictions. Deep learning and artificial neural networks are approaches used in machine learning to build computational models which learn from training examples. Bayesian neural networks merge these fields. They are a type of artificial neural network whose parameters and predictions are both probabilistic.^[1]^[2] While standard artificial neural networks often assign high confidence even to incorrect predictions,^[3] Bayesian neural networks can more accurately evaluate how likely their predictions are to be correct.

Neural Network Gaussian Processes (NNGPs) are equivalent to Bayesian neural networks in a particular limit,^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[11]^[12] and provide a closed form way to evaluate Bayesian neural networks. They are a Gaussian process probability distribution which describes the distribution over predictions made by the corresponding Bayesian neural network. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. The equivalence between NNGPs and Bayesian neural networks occurs when the layers in a Bayesian neural network become infinitely wide (see figure). This
large width limit is of practical interest, since finite width neural networks typically perform strictly better as layer width is increased.^[13]^[14]^[8]^[15]

The NNGP also appears in several other contexts: it describes the distribution over predictions made by wide non-Bayesian artificial neural networks after random initialization of their parameters, but before training; it appears as a term in neural tangent kernel prediction equations; it is used in deep information propagation to characterize whether hyperparameters and architectures will be trainable.^[16]
It is related to other large width limits of neural networks.

after-content-x4

Table of Contents

A cartoon illustration[edit]

When parameters

{displaystyle theta }

Every setting of a neural network’s parameters

{displaystyle theta }

$theta$ corresponds to a specific function computed by the neural network. A prior distribution

{displaystyle p(theta )}

$p(theta )$ over neural network parameters therefore corresponds to a prior distribution over functions computed by the network. As neural networks are made infinitely wide, this distribution over functions converges to a Gaussian process for many architectures.

The figure to the right plots the one-dimensional outputs

{displaystyle z^{L}(cdot ;theta )}

${displaystyle z^{L}(cdot ;theta )}$ of a neural network for two inputs

{displaystyle x}

$x$ and

{displaystyle x^{*}}

$x^{*}$ against each other. The black dots show the function computed by the neural network on these inputs for random draws of the parameters from

{displaystyle p(theta )}

$p(theta )$ . The red lines are iso-probability contours for the joint distribution over network outputs

{displaystyle z^{L}(x;theta )}

${displaystyle z^{L}(x;theta )}$ and

{displaystyle z^{L}(x^{*};theta )}

${displaystyle z^{L}(x^{*};theta )}$ induced by

{displaystyle p(theta )}

$p(theta )$ . This is the distribution in function space corresponding to the distribution

{displaystyle p(theta )}

$p(theta )$ in parameter space, and the black dots are samples from this distribution. For infinitely wide neural networks, since the distribution over functions computed by the neural network is a Gaussian process, the joint distribution over network outputs is a multivariate Gaussian for any finite set of network inputs.

The notation used in this section is the same as the notation used below to derive the correspondence between NNGPs and fully connected networks, and more details can be found there.

Architectures which correspond to an NNGP[edit]

The equivalence between infinitely wide Bayesian neural networks and NNGPs has been shown to hold for: single hidden layer^[4] and deep^[6]^[7]fully connected networks as the number of units per layer is taken to infinity; convolutional neural networks as the number of channels is taken to infinity;^[8]^[9]^[10] transformer networks as the number of attention heads is taken to infinity;^[17]recurrent networks as the number of units is taken to infinity.^[12]
In fact, this NNGP correspondence holds for almost any architecture: Generally, if an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e. a tensor program), then it has an infinite-width GP.^[12]
This in particular includes all feedforward or recurrent neural networks composed of multilayer perceptron, recurrent neural networks (e.g. LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization.

Correspondence between an infinitely wide fully connected network and a Gaussian process[edit]

This section expands on the correspondence between infinitely wide neural networks and Gaussian processes for the specific case of a fully connected architecture. It provides a proof sketch outlining why the correspondence holds, and introduces the specific functional form of the NNGP for fully connected networks. The proof sketch closely follows the approach in Novak, et al., 2018.^[8]

Network architecture specification[edit]

An NNGP is derived which is equivalent to a Bayesian neural network with this fully connected architecture.

Consider a fully connected artificial neural network with inputs

{displaystyle x}

$x$ , parameters

{displaystyle theta }

$theta$ consisting of weights

{displaystyle W^{l}}

${displaystyle W^{l}}$ and biases

{displaystyle b^{l}}

${displaystyle b^{l}}$ for each layer

{displaystyle l}

$l$ in the network, pre-activations (pre-nonlinearity)

{displaystyle z^{l}}

${displaystyle z^{l}}$ , activations (post-nonlinearity)

{displaystyle y^{l}}

${displaystyle y^{l}}$ , pointwise nonlinearity

{displaystyle phi (cdot )}

$phi (cdot )$ , and layer widths

{displaystyle n^{l}}

${displaystyle n^{l}}$ . For simplicity, the width

{displaystyle n^{L+1}}

${displaystyle n^{L+1}}$ of the readout vector

{displaystyle z^{L}}

${displaystyle z^{L}}$ is taken to be 1. The parameters of this network have a prior distribution

{displaystyle p(theta )}

$p(theta )$ , which consists of an isotropic Gaussian for each weight and bias, with the variance of the weights scaled inversely with layer width. This network is illustrated in the figure to the right, and described by the following set of equations: