Deep learning analytics
Bernd Porr
We have a dataset
which should be converted
via the function
 |
(1) |
into
. The goal is to learn this function with the help of
examples
;
.
The following sections show how learning of
with the help
fo examples can be achieved. This is called “inductive learning”.
We 1st introduce a linear neuron, then deriving the math for a linear deep
network and finally introducing non-linearities into the deep network.
This allows to introduce the math in a more gradual way.
Figure 1:
Simple linear neuron
|
Fig. 1A shows a neuron in a single layer (Rosenblatt, 1958).
Its activation is
and is the
th neurn in this layer.
is the current time-stamp. The input activity to the neuron
is weighted by
and summed up to:
 |
(2) |
Our function
from the introduction is here a simple linear
combination of input activities where
had to be learned
to approximate
.
The task is to learn a function
which converts the input activities to output activities
(Eq. 1).
This is achived by minimising the error:
 |
(3) |
where
is the desired output value and
the actual output value.
The goal is to minimise the square of the error:
 |
(4) |
The central trick here is to take the partial derivative in respect to the weights
Why does this make sense? Fig. 1B shows the relationship
between the error
and the weight
. In this example if
the weight is slightly increased the squared error is also increased
which is not desirable. However, if increasing
reduces
the squared error
then it's a good idea to keep changing the
weight in this direction. This approach is called gradient
descent.
We can now derive the learning rule which changes the weights
. We simply insert Eq. 2,
3 and 4 into Eq. 5:
where
is the learning rate or the “step change”. The
learning rule Eq. 11 is simply a multiplication of the
input activity
with the error
(Widrow and Hoff, 1960).
Figure 2:
Multi-layer or deep neural network
|
Fig. 2 shows now a network with multiple layers.
The forward progression of the signals
from the input to
the output and follows exactly the same recipe as for the single
layer network above using Eq. 2.
Also the error
at the output layer can simply be calculated
as the difference between the actual output
and the desired output
(see
Eq. 3). The problem is how to calculate the
internal errors
und
and how they change the hidden weights
(see Eq. 10):
 |
(12) |
The central trick here is to express
with the help of the activities
at the output und then to
identify the resulting terms with our linear sum Eq. 2
and the chain rule Eq. 10:
 |
(13) |
Eq. 13 is again substituted into Eq. 12 which gives us:
 |
(14) |
with
 |
(15) |
and leads to:
The change of the internal weight
leads then to:
 |
(18) |
where
is the internal error and we can now use this recipe to
calculate all internal errors. This is called error
backpropagation which now allows to establish networks with an
arbitrary number of layers (Widrow and Hoff, 1960; Widrow and Lehr, 1990; Rumelhart et al., 1986).
Figure 3:
Non-linear neuron
|
Only if computations in the separate neurons are non-linear
deeper networks make sense. Any linear deep network could always be
reduced to a single layer network rendering any deeper network
pointless. Consequently, we need to introduce now non-linearities.
Fig. 3 shows a non-linear neuron which is the central building block
of any deep network where after the linear summation we introduce a non-linearity
which we call activation function:
 |
(19) |
The crucial question is if the introduction of the activation function
changes the learning rule. Luckily not much.
Looking at Eq. 10 it becomes directly clear that the
chain rule is simply expanded by another term:
 |
(20) |
where
is simply the derivative of the activation function
. Consequently the
weight change at the output is now calculated as:
 |
(21) |
and the internal error as:
 |
(22) |
Strictly, the activation function needs to be differentiable. However,
it turned out that the one way rectifier (Rectifiying Linear Unit = ReLU)
works extremely well (Fukushima, 1975):
 |
(23) |
Its derivative has no definite solution at its origin so that one needs
to decide if the derivative is zero there or one.
Other popular activation functions are
or
.
-
Fukushima, K. (1975).
- Cognitron: A self-organizing multilayered neural network.
Biol. Cybern., 20(3–4):121–136.
-
Rosenblatt, F. (1958).
- The perceptron: a probabilistic model for information storage and
organization in the brain.
Psychol. Rev., 65(6):386–408.
-
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).
- Learning representations by back-propagating errors.
Nature, 323(6088):533–536.
-
Widrow, B. and Lehr, M. (1990).
- 30 years of adaptive neural networks: perceptron, madaline, and
backpropagation.
Proceedings of the IEEE, 78(9):1415–1442.
-
Widrow, G. and Hoff, M. (1960).
- Adaptive switching circuits.
IRE WESCON Convention Record, 4:96–104.
Deep learning analytics
This document was generated using the
LaTeX2HTML translator Version 2021.2 (Released July 1, 2021)
The command line arguments were:
latex2html deep-learning-analytics -split 1 -dir docs -t 'Deep learning analytics' -address '
github / contact
'
The translation was initiated on 2023-05-08
github / contact