Backpropagation Derivation - Delta Rule

I enjoyed writing my background, however the bit I was really surprised to have enjoyed writing up is the derivation of back-propagation. I’ve read many books, articles and blogs that of course venture to do the same but I didn’t find any of them particularly intuitive. The best I did find were probably that of Bishop (1995) and Haykin (1994), which I based my derivation on.

Below I include this derivation of back-propagation, starting with deriving the so-called `delta rule’, the update rule for a network with a single hidden layer, and expanding the derivation to multiple-hidden layers, i.e. back-propagation.

The Delta Rule: Learning with a Single Hidden Layer

We start by describing how to learn with a single hidden layer, a method known as the delta rule. The delta rule is a straight-forward application of gradient descent (i.e. hill climbing), and is easy to do because in a neural network with a single hidden layer, the neurons have direct access to the error signal.

Detailed illustration of a single-layer neural network
Detailed illustration of a single-layer neural network trainable with the delta rule. The input layer consists of a set of inputs, $\{ X_{0}, \ldots, X_{N} \}$. The layer has weights $\{w_{j0}, \ldots, w_{jN}\}$, bias $b_j$, net neuron activation $a_j = \sum_i w_{ji}$, activation function $f$, and output $y_j$. The error for output $y_j$, $e_j$, is calculated using the target label $t_j$.

The delta rule for single-layered neural networks is a gradient descent method, using the derivative of the network’s weights with respect to the output error to adjust the weights to better classify training examples.

Training is performed on a training dataset $X$, where each training sample $\mathbf{x}^n\in X$ is a vector $\mathbf{x}^n = (X^n_0, \ldots, X^n_N)$. Assume that for a given training sample $\mathbf{x}^n$, the $i$th neuron in our single-layer neural network has output $y^n_j$, target (desired) output $t^n_j$, and weights $w=(w_{j0}, \ldots, w_{jM})$, as shown in the above figure. We can consider the bias to be an extra weight with a unit input, and thus we can omit the explicit bias from the derivation.

We want to know how to change a given weight $w_{ji}$ given the output of node $j$ for a given input data sample $\mathbf{x}^n$,

\[\begin{equation} y^n_j = f\left( a^n_j \right),\label{eqn:output} \end{equation}\]

where the net activation $a^n_j$ is,

\[\begin{equation} a^n_j = \sum_i w_{ji} X^n_{i}.\label{eqn:weightsum} \end{equation}\]

To do so, we must use the error of our prediction for each output $y_j$ and training sample $X^n$ as compared to the known label $t_j$, \(\begin{equation} e^n_j = y^n_j - t^n_j.\label{eqn:error} \end{equation}\)

For this derivation, we assume the error for a single sample is calculated by the sum of squared errors of each output. In fact, the derivation holds as long as our error function is in the form of an average (Bishop, 1995),

\[\begin{equation} E^n = \frac{1}{2} \sum_j {\left(e^n_j\right)}^2.\label{eqn:errorsum} \end{equation}\]

The chain rule allows us to calculate the sensitivity of the error to each weight $w_{ji}$ in the network,

\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \frac{\partial E^n}{\partial e^n_j}\frac{\partial e^n_j}{\partial y^n_j} \frac{\partial y^n_j}{\partial a^n_j} \frac{\partial a^n_j}{\partial w_{ji}}. \end{align}\]

Differentiating \eqref{eqn:errorsum} with respect to $e^n_j$,

\[\begin{align} \frac{\partial E^n}{\partial e^n_j} &= e^n_j, \end{align}\]

\eqref{eqn:error} with respect to $y^n_j$,

\[\begin{align} \frac{\partial e^n_j}{\partial y^n_j} &= 1, \end{align}\]

\eqref{eqn:output} with respect to $a^n_j$,

\[\begin{align} \frac{\partial y^n_j}{\partial a^n_j} &= f'\left( a^n_j \right), \label{eqn:partialdyda} \end{align}\]

and finally \eqref{eqn:weightsum} with respect to $w_{ji}$,

\[\begin{align} \frac{\partial a^n_j}{\partial w_{ji}} &= \frac{\partial}{\partial w_{ji}} \left( \sum_i w_{ji} X_{i} \right)\\ &= X_i,\label{eqn:sumonetermpartial} \end{align}\]

since only one of the terms in the sum is related to the specific weight $w_{ji}$.

Thus the sensitivity is, \(\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= e^n_j f'\left( a^n_j \right) X_i. \label{eqn:deltasensitivity} \end{align}\)

Typically what is variously called the local gradient, error, or simply delta, is then defined,

\[\begin{align} \delta^n_j &\equiv \frac{\partial E^n}{\partial a^n_j} \\ &= \frac{\partial E^n}{\partial e^n_j}\frac{\partial e^n_j}{\partial y^n_j} \frac{\partial y^n_j}{\partial a^n_j} \\ &= e^n_j f'\left( a^n_j \right),\label{eqn:delta} \end{align}\]

such that \eqref{eqn:deltasensitivity} can be rewritten,

\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j X_i. \end{align}\]

The delta rule adjusts each weight $w_{ji}$ proportional to the sensitivity,

\[\begin{align} \Delta w_{ji} &= -\gamma \frac{\partial E^n}{\partial w_{ji}}, \end{align}\]

where $\gamma$ is a constant called the learning rate or step size. Using the delta defined in \eqref{eqn:delta}, this is simply written,

\[\begin{align} \Delta w_{ji} &= -\gamma \delta^n_j x_i. \end{align}\]

Continue to Multi-layer Backprogatation Derivation

Leave a Comment

Your email address will not be published. Required fields are marked *