Deep learning analytics

Bernd Porr


Motivation

We have a dataset $\vec{x}_1, \ldots, \vec{x}_N$ which should be converted via the function

$\displaystyle \vec{y}=f(\vec{x})$ (1)

into $\vec{y}$. The goal is to learn this function with the help of examples $\vec{x}_1,\vec{y}_1$; $\vec{x}_2,\vec{y}_2; \ldots$.

The following sections show how learning of $f$ with the help fo examples can be achieved. This is called “inductive learning”.

We 1st introduce a linear neuron, then deriving the math for a linear deep network and finally introducing non-linearities into the deep network. This allows to introduce the math in a more gradual way.

Linear neuron

Figure 1: Simple linear neuron
\includegraphics[width=\textwidth]{one_layer}

Fig. 1A shows a neuron in a single layer (Rosenblatt, 1958). Its activation is $y(n)$ and is the $i$th neurn in this layer. $n$ is the current time-stamp. The input activity to the neuron $x_j(n) =
y_j(n)$ is weighted by $\omega_{ji}$ and summed up to:

$\displaystyle y_i(n) = \sum_j y_j(n) w_{ji}$ (2)

Our function $f$ from the introduction is here a simple linear combination of input activities where $\omega_{ji}$ had to be learned to approximate $f$.

The error signal

The task is to learn a function $f$ which converts the input activities to output activities (Eq. 1). This is achived by minimising the error:

$\displaystyle e_i(n) = y_i(n) - d_i(n)$ (3)

where $d_i(n)$ is the desired output value and $y_i(n)$ the actual output value. The goal is to minimise the square of the error:

$\displaystyle E = \frac{1}{2} e^2$ (4)

Gradient descent

The central trick here is to take the partial derivative in respect to the weights $\omega_{ji}$
$\displaystyle \Delta\omega_{ji}$ $\displaystyle =$ $\displaystyle - \mu \frac{\partial E}{\partial \omega_{ji}}$ (5)
$\displaystyle \omega_{ji}$ $\displaystyle \leftarrow$ $\displaystyle \omega_{ji} + \Delta\omega_{ji}$ (6)

Why does this make sense? Fig. 1B shows the relationship between the error $E$ and the weight $\omega_{ji}$. In this example if the weight is slightly increased the squared error is also increased which is not desirable. However, if increasing $\omega_{ji}$ reduces the squared error $E$ then it's a good idea to keep changing the weight in this direction. This approach is called gradient descent.

Learning rule for the single layer

We can now derive the learning rule which changes the weights $\omega_{ji}$. We simply insert Eq. 2, 3 and 4 into Eq. 5:
$\displaystyle \Delta\omega_{ji}$ $\displaystyle =$ $\displaystyle - \mu \frac{1}{2} \frac{\partial ( d_i(n) - y_i(n) )^2 }{\partial \omega_{ji}}$ (7)
  $\displaystyle =$ $\displaystyle - \mu \frac{1}{2} \frac{\partial \left( d_i(n) - \sum_j y_j(n) w_{ji} \right)^2 }{\partial \omega_{ji}}$ (8)
  $\displaystyle =$ $\displaystyle \mu \underbrace{\left(d_i(n) - \sum_j y_j(n) w_{ji}\right)}_{-e_i(n)} \cdot y_j(n)$ (9)
  $\displaystyle =$ $\displaystyle - \mu \underbrace{\frac{\partial E}{\partial y_i}}_{-e_i(n)} \underbrace{\frac{\partial y_i}{\partial \omega_{ji}}}_{y_j(n)}$ (10)
  $\displaystyle =$ $\displaystyle \mu \cdot e_i(n) \cdot y_j(n)$ (11)

where $\mu << 1$ is the learning rate or the “step change”. The learning rule Eq. 11 is simply a multiplication of the input activity $y_j(n)$ with the error $e_i(n)$ (Widrow and Hoff, 1960).

Multi-layer network: error backpropagation and learning rule

Figure 2: Multi-layer or deep neural network
Image multi_layer

Fig. 2 shows now a network with multiple layers. The forward progression of the signals $y_j$ from the input to the output and follows exactly the same recipe as for the single layer network above using Eq. 2. Also the error $e_i$ at the output layer can simply be calculated as the difference between the actual output $y_i$ and the desired output $d_i$ (see Eq. 3). The problem is how to calculate the internal errors $e_j$ und $e_k$ and how they change the hidden weights $\omega_{kj}$ (see Eq. 10):

$\displaystyle \frac{\partial E}{\partial \omega_{kj}} = \underbrace{\frac{\partial E}{\partial y_j}}_\textrm{trick!} \frac{\partial y_j}{\partial \omega_{kj}}$ (12)

The central trick here is to express $\frac{\partial E}{\partial y_j}$ with the help of the activities $y_i$ at the output und then to identify the resulting terms with our linear sum Eq. 2 and the chain rule Eq. 10:

$\displaystyle \frac{\partial E}{\partial y_j} = \sum_i \underbrace{\frac{\parti...
...y_i}}_{-e_i} \cdot \underbrace{\frac{\partial y_i}{\partial y_j}}_{\omega_{ji}}$ (13)

Eq. 13 is again substituted into Eq. 12 which gives us:

$\displaystyle \frac{\partial E}{\partial \omega_{kj}} = - \left( \sum_i e_i \omega_{ji} \right) \frac{\partial y_j}{\partial \omega_{kj}}$ (14)

with

$\displaystyle y_j = \sum_k y_k(n) w_{kj}$ (15)

and leads to:
$\displaystyle \frac{\partial E}{\partial \omega_{kj}}$ $\displaystyle =$ $\displaystyle - \left( \sum_i e_i \omega_{ji} \right) \underbrace{\frac{\partial \left(\sum_k y_k(n) w_{kj}\right)}{\partial \omega_{kj}}}_{y_k}$ (16)
  $\displaystyle =$ $\displaystyle - \left( \sum_i e_i \omega_{ji} \right) y_k$ (17)

The change of the internal weight $\omega_{kj}$ leads then to:

$\displaystyle \Delta\omega_{kj} = \mu \cdot y_k \cdot \underbrace{\sum_i e_i \omega_{ji}}_{e_j}$ (18)

where $e_j$ is the internal error and we can now use this recipe to calculate all internal errors. This is called error backpropagation which now allows to establish networks with an arbitrary number of layers (Widrow and Hoff, 1960; Widrow and Lehr, 1990; Rumelhart et al., 1986).

Figure 3: Non-linear neuron
\includegraphics[width=0.5\textwidth]{nonlin}

Non-linear networks

Only if computations in the separate neurons are non-linear deeper networks make sense. Any linear deep network could always be reduced to a single layer network rendering any deeper network pointless. Consequently, we need to introduce now non-linearities. Fig. 3 shows a non-linear neuron which is the central building block of any deep network where after the linear summation we introduce a non-linearity $\Theta$ which we call activation function:

$\displaystyle y_i(n) = \Theta\left(\underbrace{\sum_j y_j(n) w_{ji}}_{v_i} \right)$ (19)

The crucial question is if the introduction of the activation function changes the learning rule. Luckily not much. Looking at Eq. 10 it becomes directly clear that the chain rule is simply expanded by another term:

$\displaystyle \frac{\partial E}{\partial \omega_{ji}} = \frac{\partial E}{\part...
...l y_i}{\partial v_i}}_{\Theta^\prime} \frac{\partial v_i}{\partial \omega_{ji}}$ (20)

where $\Theta^\prime$ is simply the derivative of the activation function $\Theta$. Consequently the weight change at the output is now calculated as:

$\displaystyle \Delta\omega_{ji} = \mu \cdot \Theta^\prime(y_i) \cdot y_j \cdot e_i$ (21)

and the internal error as:

$\displaystyle \Delta\omega_{kj} = \mu \cdot \Theta^\prime(y_j) \cdot y_k \cdot \underbrace{\sum_i e_i \omega_{ji}}_{e_j}$ (22)

Strictly, the activation function needs to be differentiable. However, it turned out that the one way rectifier (Rectifiying Linear Unit = ReLU) works extremely well (Fukushima, 1975):

\begin{displaymath}\Theta(v) =
\begin{cases}
0, & \text{if}\ v < 0 \\
v, & \text{otherwise}
\end{cases}\end{displaymath} (23)

Its derivative has no definite solution at its origin so that one needs to decide if the derivative is zero there or one. Other popular activation functions are $\Theta(v)=\tanh(v)$ or $\Theta(v)=\frac{1}{1+e^{-v}}$.

References

Fukushima, K. (1975).
Cognitron: A self-organizing multilayered neural network.
Biol. Cybern., 20(3–4):121–136.

Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychol. Rev., 65(6):386–408.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).
Learning representations by back-propagating errors.
Nature, 323(6088):533–536.

Widrow, B. and Lehr, M. (1990).
30 years of adaptive neural networks: perceptron, madaline, and backpropagation.
Proceedings of the IEEE, 78(9):1415–1442.

Widrow, G. and Hoff, M. (1960).
Adaptive switching circuits.
IRE WESCON Convention Record, 4:96–104.

About this document ...

Deep learning analytics

This document was generated using the LaTeX2HTML translator Version 2021.2 (Released July 1, 2021)

The command line arguments were:
latex2html deep-learning-analytics -split 1 -dir docs -t 'Deep learning analytics' -address '


github / contact

'

The translation was initiated on 2023-05-08


github / contact