Contents


Fundamentals of Activation Functions

Whenever we are building deep neural networks we have to consider choosing a suitable model architecture.

Mathematically, neurons looks like this: \(z = w^{T} X + b\) followed by some activation function that “determines” if the neuron is “fired” and we get our output \(\hat{y} = a(z) = a(w^{T} X + b)\). This is the forward propagation/calculation. However, we want to train our neural network and therefore have evaluate the resulting predictions using some error function and have to propagate through the neural network backwards to train our model (update weights \(w\) and bias \(b\)). Therefore, we calculate the derivatives.

This leads us to two major problems with training neural networks:

  1. The problem of diminishing gradients. This describes a phenomena that is caused by too small gradients during backpropagation. The deep the network is the less of these gradients reach the first layers to update (train) their weights.

  2. The problem of exploding gradients. In such a case the gradient can cause weights to become too large and hence cause nan (not a number) errors.

Understanding of activation functions and choosing an appropriate one can help us to minimize these two problems. Other approaches are: proper normalization, weight regularization, gradient clipping and of course changing the model architecture.

Activation Functions

This is an incomplete list of activation functions with the focus on a visual understanding of the forward function as well as the derivative for backpropagation. Wikipedia has a more comprehensive list with a stronger focus on mathematics of probably all activation functions. This list includes ranges and order of continuity of theses functions.

Sigmoid

In general, a sigmoid function is a function that results in a S-shaped curve if plotted. We will find many examples of S-shaped functions in the Wikipedia article on “Sigmoid function”. However, in machine learning the logistic function is usually referred to as the sigmoid function. Is defined as looks as follows:

\[f(x) = \sigma(x) = \frac{1}{1 + e^{-x}}\] \[\frac{d}{dx}f(x) = \frac{d}{dx}\sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^{2}}\]

SoftPlus

In this case the derivative of the SoftPlus function is the Sigmoid function.

\[f(x) = ln(1 + e^{x})\] \[\frac{d}{dx}f(x) = \frac{1}{1 + e^{-x}}\]

Softsign

\[f(x) = \frac{x}{1 + |x|}\] \[\frac{d}{dx}f(x) = \frac{1}{(1 + |x|)^{2}}\]

tanh

\[f(x) = tanh (x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} \ \ \ \left(= \frac{sinh(x)}{cosh(x)} \right) \ \ \ \left(= \frac{2}{1 + e^{-2x}} - 1 \right)\] \[\frac{d}{dx}f(x) = \frac{d}{dx} tanh (x) = \frac{4}{(e^{-x} + e^{x})^{2}}\]

ReLU

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ 0, & \text{if } x < 0 \end{cases} (= max(x,0))\] \[\frac{d}{dx}f(x) = \begin{cases} 1, & \text{if } x \geq 0 \\ 0, & \text{if } x < 0 \end{cases}\]

NB! The plot shows an approximation of the step function (\(\frac{d}{dx}\)) and therefore is inaccurate between -0.01 and 0

Another approach to ReLU is to concatenate ReLUs and is called CReLU. CReLU stands for Concatenated ReLU and was introduced by Shang et al (2016).

ReLU6

ReLU-n is a capped ReLU at n. It was introduced by Krizhevsky in 2010. ReLU-n works as follows: The the maximum value of y = n: \( min(ReLU(x), n) \). It seems to shift learning sparse features earlier (https://news.ycombinator.com/item?id=16540743).

Leaky ReLU

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ 0.01x, & \text{if } x < 0 \end{cases}\] \[\frac{d}{dx}f(x) = \begin{cases} 1, & \text{if } x \geq 0 \\ 0.01, & \text{if } x < 0 \end{cases}\]

NB! The plot shows an approximation of the step function (\(\frac{d}{dx}\)) and therefore is inaccurate between -0.01 and 0

ELU

ELU stands for “Exponential Linear Unit” and was introduced by Clevert et al. (2015). It gained popularity due to nVIDIAs self-driving car paper in 2016.
The ELU function is defined as:

\[f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha (e^{x} -1), & \text{if } x \leq 0 \end{cases}\] \[\frac{d}{dx}f(x) = \begin{cases} 1, & \text{if } x > 0 \\ f(x, \alpha) + \alpha, & \text{if } x \leq 0 \end{cases}\]

with \( \alpha > 0 \)

SELU

SELU stands for Scaled Exponential Linear Unit and is a slightly modified version of ELU by Klambauer et al 2017.

\[f(x) = \lambda \begin{cases} \alpha(e^{x}-1), & \text{if } x < 0 \\ x, & \text{if } x \geq 0 \end{cases}\] \[\frac{d}{dx}f(x) = \lambda \begin{cases} \alpha e^{x}, & \text{if } x < 0 \\ 1, & \text{if } x \geq 0 \end{cases}\]

NB! The plot shows an approximation of the step function (\(\frac{d}{dx}\)) and therefore is inaccurate between -0.01 and 0

Swish

Swish originates from work by Ramachandran et al. (2017), and aims to replace ReLU. A notable thing is that the \(\beta\) scaling factor can be trained as well.

\[f(x,\beta) = \text{Swish}(x,\beta) = \frac{x}{1 + e^{-x\beta}}\] \[\frac{d}{dx}f(x,\beta) = \frac{d}{dx}\text{Swish}(x,\beta) = \frac{1}{1 + e^{-x\beta}} + \frac{\beta xe^{-x\beta}}{(e^{-\beta x} + 1)^{2}}\]

If \(\beta = 1\), then we’ll end up with the Sigmoid-weighted Linear Unit (SiL) by Elfwing et al. (2017):

Mish

Mish is a fairly new activation function and originates from work by Misra (2019), and aims to replace ReLU.

\[f(x) = \text{Mish}(x) = x \tanh\left(\ln\left(\mathrm{e}^x+1\right)\right)\] \[\frac{d}{dx}f(x) = \frac{d}{dx}\text{Mish}(x) = \tanh\left(\ln\left(\mathrm{e}^x+1\right)\right)+\dfrac{x\mathrm{e}^x\operatorname{sech}^2\left(\ln\left(\mathrm{e}^x+1\right)\right)}{\mathrm{e}^x+1}\]

GELU

Have a look at Hendrycks and Gimpel (2018): Gaussian Error Linear Units (GELUs). arXiv:1606.08415

Comparison of Activation Functions

It is wise to choose activation function that derivative is not close to zero at zero. Therefore, ReLU is very popular. However, it may is wiser to use Leaky ReLU to preserve some gradient once the input value is smaller than 0.

Select the activation functions you want to compare. On the upper graph you’ll find the forward propagation, on the lower graph the backward propagation (the derivative).Plots of single activation functions can be displayed/turned off by clicking on the function name in the legend.

References

  • Elfwing, S.; Uchibe, E. & Doya, K. (2017): Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv:1702.03118

  • Klambauer, G.; Unterthiner, T.; Mayr, A. & S. Hochreiter (2017): Self-Normalizing Neural Networks. In: Advances in Neural Information Processing Systems (NIPS). Preprint available at: https://arxiv.org/abs/1706.02515.

  • Krizhevsky (2010): Convolutional Deep Belief Networks on CIFAR-10. Online available at: https://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf.

  • Misra, D. (2019): Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv:1908.08681

  • Ramachandran, P.; Zoph, B. & Le, Q.V. (2017): Searching for Activation Functions. arXiv:1710.05941

  • Shang, W.; Sohn, K.; Almeida, D. & H. Lee (2016): Understanding and Improving Convolutional Neural Networks using Concatenated Rectified Linear Units. Preprint available at: https://arxiv.org/abs/1603.05201.