Backpropagation Derivation - Multi-layer Neural Networks
The Limitations of Single-Layer Networks
A single-layer neural network, such as the perceptron shown in fig. 1, is only a linear classifier, and as such is ineffective at learning a large variety of tasks. Most notably, in the 1969 book Perceptrons, the authors showed that single-layer perceptrons could not learn to model functions as simple as the XOR function, amongst other non-linearly separable classification problems.
As shown in fig. 2, no single line can separate even a sparse sampling of the XOR function — i.e. it is not linearly separable. Instead, only a composition of lines is able to correctly separate and classify this function, and other non-linearly separable problems.
At the time, it was not obvious how to train networks with more than one layer of neurons, since the methods of learning neuron weights, the perceptron learning rule (Rosenblatt et al.1961) for perceptrons or the delta rule (Widrow et al., 1960) — as we derived in the previous post — for general neurons, only applied to single-layered networks. This became known as the credit-assignment problem.
Backpropagation
The credit-assignment problem was solved with the discovery of backpropagation (also known as the generalized delta rule), allowing learning in multi-layer neural networks. It is somewhat controversial as to who first “discovered” backpropagation, since it is essentially the application of the chain rule to neural networks, however it’s generally accepted that it was first demonstrated experimentally by Rumelhart et al., 1961. Although it is “just the chain rule”, to dismiss this first demonstration of backpropagation in neural networks is to understate the importance of this discovery to the field, and to dismiss the practical difficulties in first implementing the algorithm — a fact that will be attested to by anyone who has since attempted.
The following is a derivation of backpropagation loosely based on the excellent references of Bishop (1995) and Haykin (1994), although with different notation. This derivation builds upon the derivation for the delta rule in the previous section, although it is important to note that, as shown in fig. 3, the indexing we will use to refer to neurons of different layers differs from that in fig. 1 for the single-layer case.
We are interested in finding the sensitivity of the error $E$ to a given weight in the network $w_{ij}$. There are two classes of weights for which we must derive different rules,
- Output Neurons: those belonging to output layer neurons, i.e. neurons lying directly before the output, such as $w_{kj}$ in fig. 3, and
Output Layer
The output weights are relatively easy to find, since they correspond to the same types of weights found in single-layer networks, and have direct access to the error signal, i.e. $e^n_j$.
Indeed our derivation of the delta rule also describes the sensitivity of the weights in the output layer of a multi-layer neural network. With some change of notation (now indexing by $k$ rather than $j$ to match fig. 3), we can use the same delta rule sensitivity,
\[\begin{align} \frac{\partial E^n}{\partial w_{kj}} &= \frac{\partial E^n}{\partial a^n_k} \frac{\partial a^n_k}{\partial w_{kj}}\\ \frac{\partial E^n}{\partial w_{kj}} &= \frac{\partial E^n}{\partial e^n_k}\frac{\partial e^n_k}{\partial y^n_k} \frac{\partial y^n_k}{\partial a^n_k}\\ &= - e^n_k f'\left( a^n_k \right) x^n_{kj}\\ &= \delta^n_k x^n_{kj}\\ &= \delta^n_k y^n_j.\label{eqn:outputlayer} \end{align}\]Hidden Layer
We will first derive the partial derivative $\frac{\partial E^n}{\partial w_{ji}}$, for a single hidden layer network, such as that illustrated in fig. 3. Unlike in the case of a single layer network, as covered in the previous derivation of the delta rule, the weights belonging to hidden neurons have no direct access to the error signal, instead we must calculate the error signal from all of the neurons that indirectly connect the neuron to the error (i.e. every output neuron $y_k$).
Following from the chain rule we can write the partial derivative of a hidden weight $w_{ji}$ with respect to the error $E^n$,
\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \underbrace{\left( \sum_k \frac{\partial E^n}{\partial e^n_{k}} \frac{\partial e^n_{k}}{\partial y^n_{k}} \frac{\partial y^n_{k}}{\partial a^n_k} \frac{\partial a^n_k}{\partial y^n_{j}}\right)}_\text{output neurons} \underbrace{\frac{\partial y^n_{j}}{\partial a^n_{j}} \frac{\partial a^n_{j}}{\partial w_{ji}}}_\text{hidden neuron},\label{eqn:twolayer1} \end{align}\]where the sum arises from the fact that, unlike in delta rule: (10) where the weight $w_{kj}$ affects only a single output, the hidden weight $w_{ji}$ affects all neurons in the subsequent layer (see fig. 3).
We already know how to calculate the partials for the output layer from the derivation of the delta rule for single-layer networks, and we can substitute these from \eqref{eqn:outputlayer} for the output neuron and error partial derivatives,
\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \left( \sum_k \delta^n_k y^n_j \frac{\partial a^n_k}{\partial y^n_{j}}\right) \frac{\partial y^n_{j}}{\partial a^n_{j}} \frac{\partial a^n_{j}}{\partial w_{ji}}.\label{eqn:twolayer2} \end{align}\]Recall from delta rule: (2), the net activation $a$ is a sum of all previous layer weights. Thus,
\[\begin{align*} \frac{\partial a^n_k}{\partial y^n_{j}} &= \frac{\partial}{\partial y^n_{j}}\\ \left(\sum_j w_{kj} y^n_{j} \right) &= w_{kj}, \end{align*}\]and substituting from delta rule: (8) and delta rule: (10) into \eqref{eqn:twolayer2},
\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \left( \sum_k \delta^n_k y^n_j w_{kj}\right) f'\left( a^n_j \right) x_i. \label{eqn:twolayer3} \end{align}\]This bears some resemblance to the derived expression for a single-layer, and just as in delta rule: (14), we can use our definition of the delta to simplify it. For hidden layers this evaluates as \(\begin{align} \delta^n_j &\equiv \frac{\partial E^n}{\partial a^n_k}\\ &= \left( \sum_k \frac{\partial E^n}{\partial e^n_{k}} \frac{\partial e^n_{k}}{\partial y^n_{k}} \right) \frac{\partial y^n_{k}}{\partial a^n_k}\\ &= \left(\sum_k \delta^n_k y^n_j w_{kj} \right) f'\left( a^n_j \right).\label{eqn:deltahidden} \end{align}\)
This leaves us with the more convenient expression (as we will see when deriving for an arbitrary number of hidden layers), \(\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j x_i. \label{eqn:twolayer4} \end{align}\)
Arbitrary Number of Hidden Layers
The derivation above was based on the specific case of a single hidden layer network, but it is trivial to extend this result to multiple hidden layers. There is a recursion in the calculation of the partial derivatives in \eqref{eqn:deltahidden} which holds for a network with any number of hidden layers, and which we will now make explicit.
The delta is defined,
\[\begin{equation} \delta^n_i = \begin{cases} f'\left( a^n_j \right) e^n_j \,& \textrm{when neuron $j$ is output}\\ f'\left( a^n_j \right) \left( \sum_j \delta^n_j y^n_i w_{ji} \right)& \textrm{when neuron $j$ is hidden}, \end{cases}\label{eqn:fulldeltadefinition} \end{equation}\]for any adjacent neural network layers $i, j$, including the output layer where the outputs are considered to have an index $j$. The sensitivity is then,
\[\begin{align} \frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j y_i. \label{eqn:sensitivity} \end{align}\]
Leave a Comment
Your email address will not be published. Required fields are marked *