Neural Netwok: Activation Functions

Sigmoid and Derivative

1)Sigmoid:

It is also called as logistic activation function.

f(x)=1/(1+exp(-x) the function range between (0,1)

Derivative of sigmoid:

just simple u/v rule i.e (vdu-udv)/v²

df(x)=(1+exp(-x)(d(1))-d(1+exp(-x)*1/(1+exp(-x))²

d(1)=0,

d(1+exp(-x))=d(1)+d(exp(-x))=-exp(-x) so

df(x)=exp(-x)/(1+exp(-x))²

df(x)=1/(1+exp(-x))*1-(1/(1+exp(-x))

df(x)=f(x)*(1-f(x))

Loading output library...
Loading output library...

Observations:

(i) If you see the sigmoid function has values between 0 to 1

(ii) The output is not Zero-Centered

(iii) Sigmoids saturate and kill gradients.

(iv) see at top and bottom level of sigmoid functions the curve changes slowly,if you calculate slope(gradients) it is zero,that is shown in derivative curve above.

Problem with sigmoid:

due to this when the x value is small or big the slope is zero— ->then there is no learning — ->then there is no learning.

When we will use Sigmoid:

(i) if you want output value between 0 to 1 use sigmoid at output layer neuron only

(ii) when you are doing binary classification problem use sigmoid

otherwise sigmoid is not preferred

Tanh and Derivative

The tanh function is just another possible functions that can be used as a nonlinear activation function between layers of a neural network. It actually shares a few things in common with the sigmoid activation function. They both look very similar. But while a sigmoid function will map input values to be between 0 and 1, Tanh will map values to be between -1 and 1.

Tanh=(e^z-e^(-z))/(e^z+e^(-z)

Derivative of tanh(z):

a=(e^z-e^(-z))/(e^z+e^(-z)

use same u/v rule

da=(e^z+e^(-z))*d(e^z-e^(-z))-(e^z-e^(-z))*d((e^z+e^(-z))/(e^z+e^(-z)²

da=(e^z+e^(-z))*(e^z+e^(-z))-(e^z-e^(-z))*(e^z-e^(-z))/(e^z+e^(-z)²

da=(e^z+e^(-z)²-(e^z-e^(-z)²/(e^z+e^(-z)²

da=1-(e^z-e^(-z))/(e^z+e^(-z)²

da=1-a²

Loading output library...

Observations:

(i)Now it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 .

(ii) Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function .

But still it suffers from Vanishing gradient problem.

When will use:

Usually used in hidden layers of a neural network as it’s values lies between-1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.

ReLu Activation Function

Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

Loading output library...
Loading output library...

Value Range :- [0, inf)

Nature :- non-linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

it avoids and rectifies vanishing gradient problem . Almost all deep learning Models use ReLu nowadays.

But its limitation is that it should only be used within Hidden layers of a Neural Network Model.

Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.

To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.

We then have another variant made form both ReLu and Leaky ReLu called Maxout function .

LeakyRelu Activation Function

Loading output library...

Softmax Function

Softmax turns arbitrary real values into probabilities, which are often useful in Machine Learning. The math behind it is pretty simple: given some numbers,

The outputs of the Softmax transform are always in the range 0,1 and add up to 1. Hence, they form a probability distribution.

f(xs) = np.exp(xs) / sum(np.exp(xs))

Loading output library...