Learning with Backpropagation
Learning with backpropagation is much like the delta rule; sensitivities are used to correct weights proportional to a constant learning rate or step size parameter $\gamma$. Although the correction is proportional to the sensitivity, we wish to reduce the error $E^n$, and so we move the weight in the opposite direction of the gradient $\frac{\partial E^n}{\partial w_{ji}}$.
Gradient Descent
Formally, the weight change rule is given by,
\[\begin{align} \Delta w^n_{ij} &= -\gamma \frac{\partial E^n}{\partial w_{ji}}\\ &= -\gamma \, \delta^n_j y_i,\label{eqn:backproplearningrule} \end{align}\]where $\delta^n_j$ is as defined in backprop: (13), and $y_i$ is the output of neuron $i$.
Backpropagation is a method of steepest descent. This is illustrated in fig. 1, where the backpropagation learning rule, \eqref{eqn:backproplearningrule}, specifies a step size in the form of the learning rate. The learning rate parameter scales the step size, or the magnitude of the weight change vector. Fig. 1 also illustrates the effect of learning rate on gradient descent. Too small a learning rate can result in very slow learning such as for $\gamma_0$, while too large a step size can result in bouncing around the minima ($\gamma_2, \gamma_3$), or missing it altogether.
In order to settle into a local minima, the learning rate must also be decreased as training progresses. However, too fast a rate of decrease and it may never reach the basin of attraction of the local minima, as with $\gamma_0$, while if the rate of decrease is too slow it will take a very long time to enter the basin of attraction, such as with $\gamma_3$.
The balance of trying to find an appropriate learning rate and learning policy is unfortunately part of the “black magic” behind training DNNs which comes from experience, but Bottou et al., 2012, Goodfellow et al., 2016 are excellent references on some of the common approaches taken to make this task simpler.
The Problem with First-Order Optimization
The underlying reason learning rate and learning policy has such a large effect is that gradient descent is a first-order optimization method, and only considers the first-order partial derivatives, i.e. for a 2D error surface $E(x, y)$, gradient descent moves in the opposite direction of the gradient,
\[\begin{equation} \nabla E(x, y) = {\left(\frac{\partial E}{\partial x}, \frac{\partial E}{\partial y}\right)}. \end{equation}\]This gradient tells us the direction of maximum increase at a given point on the error surface, but it does not tell us any information about the curvature of the surface at that point. The curvature of the surface is described by higher-order derivatives such as the second-order partial derivatives, e.g. $\frac{\partial^2 E}{\partial x^2}$, and mixed partial derivatives, e.g. $\frac{\partial^2 E}{\partial x\,\partial y}$.
These second-order partials give important information about the curvature of the error surface $E$. For example, in fig. 1, the error surface takes on an elliptical shape, which causes problems when we only consider the direction of maximum decrease $-\nabla E$. The classic example of such a pathological error surface for first-order methods is an error surface that looks like a narrow valley, as shown in fig. 2. With an initialization outside the bottom of the valley, gradient descent will bounce along the walls of the valley, leading to a very slow learning convergence.
For well-behaved surfaces where the scaling of parameters is similar, basins of attraction around a minima are roughly circular, and thus avoid this problem, since the first-order gradients will point almost directly at the minima for any location on the error surface.
There are second-order optimization methods based on Newton’s method, however the issue is that they do not scale to the size of any practical DNNs. The matrix of second-order partial derivatives for a scalar-values function, the Hessian $\mathbf{H}$, is required for any full second-order optimization method, however the Hessian is square in the number of parameters in the network. For networks of millions of parameters this means storing the Hessian is infeasible.
There are a whole slew of optimization tricks for gradient descent, often attempting to compensate for the shortcomings of first-order optimization without using the Hessian, or using some approximation to it. We will not cover those here, since none of these were used in our experiments. A full background of the issues of optimization in DNNs is outside the scope of this dissertation, however interested readers should refer to Goodfellow et al., 2016 to learn more about these methods, and Martens et al., 2010 for an excellent introduction to the problems of first and second-order optimization in DNNs.
Momentum
A common improvement to gradient descent is momentum (Polyak et al., 1964, Rumelhart et al., 1961), a trick for minimizing the effect of pathological curvature on gradient descent, which also helps with variance in gradients. The name comes from the analogy of the update to physical momentum $\rho$ for a moving particle, $\mathbf{p}=m\mathbf{v}$, where we assume unit mass, $m=1$.
In momentum, the gradients over multiple iterations are accumulated into a velocity gradient,
\[\begin{align} \mathbf{v}_{t+1} &= \alpha \mathbf{v}_{t} - \gamma\nabla E(\mathbf{w})\\ \Delta\mathbf{w} &= \mathbf{w}_t + \mathbf{v}_{t+1}, \end{align}\]where $\gamma$ is the learning rate, $t$ is the iteration, $\nabla E$ is the gradient of the error surface $E(\mathbf{w})$ being minimized, and $\mathbf{w}$ is the weight vector optimized. Momentum in effect stores some information on the gradients found in past iterations, and uses this to damp the effect of a new gradient on the search direction, as illustrated in fig. 1. For error surfaces with pathological curvatures, this can dramatically speed up learning.
Batch and Stochastic Gradient Descent
Although the backpropagation weight change rule, \eqref{eqn:backproplearningrule}, tells us how to change the weights given a single training sample $x^n$, in practice this method is rarely used. The reason is simply that the gradients from a single sample are too biased, or noisy, and they are not representative of the dataset in general; $\Delta w^n_{ij}$ is only an approximation to the true gradient we want — it is only from one sample, $x^n$, of the training dataset $X$.
Batch Gradient Descent
At the opposite end of the spectrum there is batch training where the gradient is computed over all data samples in the training set,
\[\begin{align} \Delta w_{ij} &= -\gamma\frac{1}{N} \sum^N_{n=0} \frac{\partial E^n}{\partial w_{ji}}, \label{eqn:batchlearningrule} \end{align}\]where $N$ is the number of training samples in $X$. Batch training gives us the true gradient, however it is also very expensive, since it requires us to perform the forward pass of the network over all training samples for every update.
Stochastic Gradient Descent
Instead of computing the gradient on only one training sample, or over the entire training set, we might instead use a significant subset of the training set — a mini-batch. This approach is called Stochastic Gradient Descent (SGD).
When using SGD, we randomly sample (without replacement) a subset of the training set $X_{\textrm{mb}} \subset X$, such that
\[\begin{align} \Delta w_{ij} &= -\gamma \frac{1}{|X_{\textrm{mb}}|} \sum_{\{n|\,\mathbf{x}^n \in X_{\textrm{mb}}\}} \frac{\partial E^n}{\partial w_{ji}},\label{eqn:sgdrule} \end{align}\]where the size of the mini-batch $|X_{\textrm{mb}}|$ should be significant enough to represent the statistics of the training set distribution, i.e. for a classification problem the mini-batch should capture a significant number of the classes in the training set. Using a mini-batch size of one, i.e. a single sample as shown in \eqref{eqn:backproplearningrule}, is a special case of SGD.
It has been observed in practice that adding noise to the gradient by using stochastic gradient descent often helps generalization compared to batch gradient descent, perhaps by preventing overfitting. Note that even if we use the true gradient for the training dataset, the training set $X$ is only a sampling of the population distribution we want our network to generalize to.
Leave a Comment
Your email address will not be published. Required fields are marked *