Activation Functions — Core of Neural Networks Explained

Ruman
13 min readApr 13, 2024

--

In this article we’ll explore the crucial role of activation functions as the core building blocks of artificial neural networks. We’ll look at it’s functionality and significance within neural networks.

Photo by Google DeepMind

Outline

  • Activation Functions and Biological Neurons
  • How does Information Propagate in a Neural Network
  • Activation Functions, and why do we Need them
  • Linear Activation Functions
  • Non-linear Activation Functions
  • Sigmoid, Tanh, ReLu, Leaky ReLU, and Softmax
  • Conclusion

⚠️ Feel free to skip directly to “Activation Functions, and why do we Need them” The next two sections provide context to better understand Activation Functions.

Activation Functions and Biological Neurons

Diagram of neuron and synapse. Information transfer occurs at the synapse, a junction between the axon terminal of the current neuron and the dendrite of the next neuron.
Image Source : Researchgate

The inspiration for the deep neural networks we use today in LLMs, computer vision, and others comes from our understanding of the human brain and its fundamental building blocks — biological neurons.

In the human brain, there are billions of neurons connected to each other. Each neuron receives and processes information from other neurons through electrical and chemical (electrochemical) signals transmitted across synapses. Before passing information to the next neuron, the receiving neuron decides whether the incoming signal is strong enough to trigger its own activation. This process of a neuron “firing” or becoming activated is a key part of how information is processed and transmitted in the brain.

Network of interconnected nodes (or artificial neurons) passing information to each other.

Similarly, in artificial neural networks, we have a network of interconnected nodes (or artificial neurons) that pass information to each other. Within each artificial neuron, there is an activation function that determines whether the input signal is strong enough to trigger the neuron’s activation and allow the information to be passed on to the next layer of the network.

These activation functions are crucial to the performance of the entire artificial neural network, as they control which signals are amplified and propagated through the network .

Before jumping directly to Activation Functions, let’s take a moment to understand how information flows through Artificial Neural Networks.

How does Information Propagate in a Neural Network

Image Source : researchgate.net

As the name suggests, a Neural Network is a network of interconnected artificial neurons (also known as nodes) working in conjunction with each other to solve complex problems.

Each of these connection in the neural network has some weight (w_i) associated to it and each of these nurons have some “Bias (b)” component to it along with its own activation function.

A neural network typically has three main components — Input layer (this receives the input data), Hidden layers(the intermediate layers) and Output layer (final layer).

There are two ways in which information propagates through these layers in a neural network:

During Forward Pass:

  • The input data is fed into the input layer of the neural network.
  • The data is then propagated through the hidden layers, where the inputs are multiplied by the layer’s weights, and a non-linear activation function is applied to the weighted sum.
  • This transformation process continues through the hidden layers, creating a higher-level representation of the input data.
  • Finally, the transformed data from the last hidden layer is passed to the output layer, where a final transformation is applied (e.g., Softmax for classification) to produce the network’s output.

During Backward Pass (Backpropagation):

  • The error between the predicted output and the desired output is calculated at the output layer.
  • This error signal is then propagated backwards through the network, from the output layer to the hidden layers.
  • At each hidden layer, the error is used to compute the gradients of the weights and biases with respect to the error.
  • These gradients are then used to update the weights and biases of the layer, in order to minimize the overall error.
  • The process continues until the gradients reach the input layer, allowing the network to update all of its parameters to improve its performance.

Now let’s have a look at Activation Functions.

Activation Functions, and why do we Need them

Photo by Bret Kavanaugh on Unsplash

Artificial neurons form the fundamental building blocks of deep neural networks, enabling them to perform complex tasks. At the core of each neuron, there is an activation function that applies a mathematical transformation to the data received by the neuron, and then passes the transformed data on to the next neurons in the network.

This is why activation functions are also known as transfer functions. They translate the input signal of a neuron into an output signal, which is then propagated through the network. The choice of activation function can significantly impact the performance and capabilities of a deep neural network.

Let’s Look at at the image below:

A neuron receiving input from multiple sources, computing the weighted sum of these inputs before transmitting the result to an activation function.; Image source

When a signal is received at the input of a neuron, the neuron performs a series of operations on that input. First, the neuron takes the weights associated with the connections from the input sources and multiplies them with the corresponding input data. This results in a weighted sum of the inputs.

Next, the neuron adds a bias term to this weighted sum. The mathematical formulation for this can be written as:

Weighted sum of inputs with a Bias term; Image by author

Where “xi” is data received at the neuron’s input and “wi” is connection weight between the input and the neuron.

This weighted sum with the bias term represents the combined information that the neuron receives from its inputs. The neuron then applies its activation function to this value to produce the final output, which is passed on to the next layer of the neural network. The mathematical formulation for this :

Image by author

The f() term in the above equation represents the activation function used by the neuron. This activation function is an important component that we can configure and modify to control how the information is processed and passed on to the next neurons in the network.

There are various activation functions available, each with its own unique characteristics and intended use cases. We will explore some of these common activation functions in more detail later in this article.

What will happen if we don’t have the activation function in neural network?

The activation function is a critical component in neural networks, and removing it would make the neural network a linear model, significantly impacting its ability to learn complex patterns and make accurate predictions.

In simpler terms—

removing the Activation Function would result in the Neural Network losing its ability to learn complex patterns.

But how exactly 😕 ? Let’s see.

These models, whether simple machine learning models or state-of-the-art deep learning models, are essentially functions that take inputs and produce predictions. Their primary task is to learn patterns from data and make accurate predictions during inference.

During training, these models need to learn patterns from the data, which can range from simple to highly complex. Let’s consider an example for better understanding:

Image Source : https://www.statistics4u.info/fundstat_eng/cc_linvsnonlin.html

In the first and second graphs, where the data is simpler and easily separable, a simple linear model (a linear function represented by a line) would be sufficient. However, in the third graph, where the data is more complex, a linear function (a line) would not be enough. To solve this, we would need a non-linear model, which is essentially a non-linear function.

Applying the same analogy to the context of neural networks, if we don’t use an activation function, the network will become linear. This is because, without the activation function, each neuron will simply pass on the same information it received to the next neurons, without any non-linear transformation.

Let’s look at the following mathematical formulation:

Weighted sum of inputs with a Bias term; Image by author

If there’s no activation function, then “neuron_info” (the weighted sum of inputs to a neuron) will be directly passed as the output. This formulation is linear, and if every neuron in the network keeps passing the same input to the output, the entire network will only be able to learn and represent linear information, losing its ability to learn and represent non-linear relationships in the data.

I hope now it all makes sense to you that “Why we need Activation Functions”

Now, let’s look at Linear and Non-linear activation functions.

Linear Activation Functions

Identity (Linear) Activation function

This is a linear activation function which passes the same info to next neurons in network which it receives as output.

Mathematical formulation :

Image by Author

Here, the function f() is the identity activation function, which takes the inputs x and returns the same value as the output, without applying any additional non-linear operation.

Image by Author

The identity activation function has a few key use cases in neural networks:

  • Regression Tasks: When training for a regression task, the identity activation function is often used in the output layer. This allows the network to produce continuous, numerical outputs without introducing any non-linear distortions.
  • Extracting Embeddings: If we have a trained model, the activations from the last few layers, where the identity activation function is applied, can be used as the embeddings for the input data.
  • And more

It’s important to note that while the identity activation function can be useful in specific scenarios, it is generally not the default choice for most hidden layers in a neural network

Non-linear Activation Functions

Photo by Ryan Hutton on Unsplash

As we discussed earlier, removing the activation function would make the entire neural network a linear model, severely limiting its ability to learn complex patterns and relationships in the data. This is because the core strength of neural networks lies in their capacity to capture and represent non-linear information.

To maintain this non-linearity, neural networks employ a variety of non-linear activation functions, with the exception of a few specialized cases where linear activation functions (such as the identity function) may be used, like in the output layer of a regression model.

Some of the commonly used non-linear activation functions in neural networks include:

Sigmoidal Activation Function

The sigmoid activation function is a non-linear, S-shaped function that maps any input value to the range between 0 and 1. It is also known as the logistic sigmoid function. The Sigmoidal Function can be employed in either the hidden or output layer of neural networks.

Mathematical Formulation :

Image by Author

Derivative of Sigmoid Activation Function :

Image by Author

The derivative of the sigmoid function ranges from 0 to 0.25, with the maximum value of 0.25 occurring at x = 0, where the sigmoid function has the steepest slope.

Image by Author

Benefits of Sigmoidal Function:

  • Bounded Output Range : The bounded output range (0, 1) can be useful in Binary Classification and to normalize the inputs or outputs of a neural network.
  • Interpretable Outputs: The output of the sigmoid function can be interpreted as a probability.
  • Gradient Behaviour: The derivative of the sigmoid function has a maximum value of 0.25 at x = 0. This means that the gradients flowing through the network can be well-behaved, as they are bounded and do not become too large or too small.

Major Drawback :

  • It can suffer from the vanishing gradient problem, where the gradients become very small for large positive or negative inputs, making it difficult to train deep neural networks effectively. We typically avoid using this function in hidden layers, especially in very deep neural networks.

Tanh Activation Function

The tanh activation function, also know as Hyperbolic Tangent Function, maps the input to a value between -1 and 1. Tanh is often used in the hidden layers of Neural Networks.

Mathematical Formulation :

Image by Author

Derivative of tanh Activation Function :

Image by Author

The derivative of the tanh function ranges from 0 to 1, with the maximum value of 1 occurring at x = 0, where the tanh function has the steepest slope.

Image by Author

Benefits of Tanh Function:

  • Zero-Centered Output : The tanh function maps the input range to the interval [-1, 1], with the output being zero-centered. This can be advantageous compared to the sigmoid function, which maps the input to the range [0, 1].
  • Zero-centered outputs can help with the optimization process during training. When the inputs to a layer have a mean close to zero, the gradients flowing through the network tend to have a more stable distribution, which can lead to faster convergence.

The property of tanh Zero-Centered Output can be particularly useful in RNNs, such as LSTMs and GRUs, where the zero-centered outputs can improve the stability and performance of the network during long-term dependencies modeling.

  • Gradient Behavior: The derivative of the tanh function has a maximum value of 1, larger Gradient than sigmoidal function, It can potentially help the model learn more effectively especially in the initial stages of training.

Major Drawback :

  • Tanh function, like the sigmoid function, can still suffer from the vanishing gradient problem in deep neural networks, where the gradients become very small for large positive or negative inputs.

ReLU Activation Function

The ReLU (Rectified Linear Unit) activation function outputs the input directly if it is positive, and 0 if the input is negative, introducing non-linearity. ReLU is mostly used in the hidden layers of Neural Networks.

Mathematical Formulation :

Image by Author

Derivative of ReLU Activation Function :

Image by Author

The derivative of the ReLU function is either 0 or 1, depending on whether the input is negative or positive, respectively.

Image by Author

Benefits of ReLU Function:

  • Vanishing Gradient Mitigation : One of the key advantages of ReLU is its ability to mitigate the vanishing gradient problem, which can occur with activation functions like sigmoid and tanh.

The derivative of the ReLU function is 1 for positive inputs and 0 for negative inputs. This means that the gradients flowing through the network can be preserved, particularly in the deeper layers, allowing the model to learn effectively.

  • Sparse Representations: The ReLU function can help the network learn sparse representations, where many of the neuron activations are zero. This sparsity can lead to more efficient and compact models, as well as improved generalization performance.
  • Computational Efficiency: The ReLU function is computationally efficient and simple to implement, as it simply sets negative inputs to 0 and passes positive inputs unchanged. This efficiency can be advantageous, especially in large-scale neural network models.

Major Drawback :

  • ReLU can sometimes suffer from the “dying ReLU” problem, where some neurons become permanently inactive (output 0) during training, reducing the network’s representation capacity.

Leaky ReLU Activation Function

Leaky ReLU is a variant of the ReLU activation function that introduces a small, non-zero slope for negative input values.

Mathematical Formulation:

Image by Author

Derivative of Leaky ReLU Activation Function :

Image by Author

The derivative of the Leaky ReLU function is 0.01 for negative inputs and 1 for non-negative inputs.

Image by Author

Benefits of ReLU Function:

  • Fix for “Dying ReLU” : Leaky ReLU introduces a small, non-zero slope (usually 0.01) for negative inputs, which allows a small amount of gradient to flow through even for negative activations. This helps prevent the “dying ReLU” issue and keeps the neurons active, even if they are not strongly activated.
  • And have all the benefits of ReLU function.

Not major but still a drawback :

  • The non-zero slope value (0.01) for negative inputs can sometimes lead to slower convergence during training compared to other activation functions. Additionally, the choice of the slope value (typically 0.01) can affect the performance of the model, and there is no universally optimal value.

Softmax Activation Function

The softmax activation function is used for multi-class classification problems, as it outputs a probability distribution over the classes.

Mathematical Formulation is :

Image by Author

The softmax activation function is commonly used in the output layer of multi-class classification models, such as image recognition or natural language processing tasks, where the model needs to predict the probability of an input belonging to each class.

Conclusion

We started with biological neurons, discussed about why Activation Functions are fundamental to deep neural networks, explored their functionality and significance, and explored various types of linear and non-linear activation functions.

If you enjoyed this article, your applause would be greatly appreciated!

Explore my other articles here.

Ultimate Guide to Fine-Tuning in PyTorch

4 stories

--

--

Ruman
Ruman

Senior ML Engineer | Sharing what I know, work on, learn and come across :)