The concept of neural networks is not new. In fact it dates back to 1943 when the first neural network model was published by McCulloch Pitts. The concept was refined further by development of a Perceptron in 1957. But it just remained a subject of academic study - mainly due to the extensive computational power required to implement it - and the lack of it in those days.

Now that we have immense computational power available to us, the neural networks have become the hot topic of study and implementation. But the core concept of the Perceptron - the fundamental unit of the neural network - has not changed.

From the mathematical point of view, there is a limit to what a linear function can do and it was noticed that results of polynomials were not good enough to justify the computational expense. Hence the concept of neurons picked up momentum. A lot of Machine Learning is inspired by how the human mind works. The concept of Neural Networks goes one step further - to take inspiration from the way Neurons are laid out in the human brain.

As you can see in the image above, the neurons get multiple inputs and based on these, give out one output that feeds into another set of neurons, and so on. The nervous system is built of many such neurons connected to each other. Each neuron contributes to the decision process by appropriately forwarding the input signal - based on the training it has gathered. Such a model has the potential to hold all that a human brain does. Each neuron has a minimal functionality that can potentially do wonders.

Neurons are implemented as linear function with a non linear topping - called the activation function. Thus, each neuron is defined by weights for each input and a bias. The result of this operation is fed into the activation function. The final output is the input for the next set of neurons. Such Artificial neuron is called a Perceptron.

Often the network has multiple layers of such Perceptrons. That is called MLP (Multy Layer Perceptron). In an MLP, we have an input layer, an output layer and zero or more hidden layers.

Each Perceptron has an array of inputs and an array of weights that are multiplied with the inputs to generate a scalar. This processing is linear - they cannot help fitting a non linear curve - irrespective of the depth of the network. If the network has to fit non linear curves, we need some non linear element on each perceptron. Hence, perceptrons are tipped with a non linear activation function. This could be a sigmoid or tanh or relu ... Researchers have offered several activation functions that have specific advantages.

With everything in place, a neural network looks like this:

The layout, width and depth of the network is one of the most interesting topics of research. Experts have developed different kinds of networks for different kinds of problems. The deeper and larger the network, more is its capacity. Human brain has around 100 billion neurons. Neural Networks are nowhere near that - some researchers quote experiments with million. This concept of large neural networks or Deep Learning is not new. But it was limited to mathematical curiosity and research papers. The recent boom in the availability of massive training data and computing power has made it a big success.

Building, training and tuning Neural Networks is a massive domain and deserves many blogs dedicated to each topic.

The perceptron, better known today as the artificial neuron is a processing unit composed to two components. The first component is the adder that gets a weighted sum of the input values. This is followed by the activation function that adds a component of non-linearity to the output.

The activation function we use is a major component of the neuron (perceptron) and hence the neural network as a whole. Rest of the neuron is just the weights that we train. But, the efficiency and trainability of the neuron is primarily defined by the activation function that it uses.

We have different types of activation functions that can be used. Each has its own ups and downs in terms of complexity, processing power and trainability. We can choose the right one based on the problem at hand.

This is the simplest and the most primitive of all. It was used by McCulloch Pitts Neuron. Here, the output value is 1 if the input is greater than 0. Else, it is 0. It is not discontinuous at x=0. So it is not possible to use this in gradient descent.

This is a frequently used activation function. It defines the output by the function g(x) = 1 / (1 + e^{-x}). The value of e^{-x} varies from 0 (for x approaching infinity) and infinity (for x approaching negative infinity)

So, the value of g(x) is a continuous and differentiable function that plays between 0 and 1. It is interesting to note that the value of g(x) is almost 1 for large numbers and almost 0 for negative large numbers. The major change occurs between -1 and +1. The importance of this property will be more evident when we look at feature normalization. This causes the problem of vanishing gradients - reducing the trainability.

The tanh - or hyperbolic tangent activation function is based on similar concepts as the sigmoid. Tanh is defined as (1 - e^{-2x}) / (1 + e^{-2x}). One can see that this output of this function varies between -1 and 1. Also, just like the sigmoid activation, tanh also has problem of vanishing gradients.

Also known as the rectified linear function, it is defined as g(x) = (x > 0) ? x : 0

This may seem too trivial for its value. But it is one of the most popular ones today. It adds non linearity as well as sparsity to the model - because every time, only a few of the nodes actually contribute to the output - rest are zeroed out. This can also have a problem that some nodes remain inactive and always zeroed out - but there are other ways to handle that problem.

This is quite different from the other activation functions we have seen so far. All the above functions depend on the inputs to the given neuron. But the softmax activation also accounts for all the neurons in the layer. It is a normalized exponential function.

Mathematically, the softmax activation function can be defined as y_{i} = e^{xi}/Σ_{j}e^{xj}

The normalization ensures that the gradient never blows up and the regression algorithm does not stumble on the gradients. But on the other hand, it can also defeats the purpose of having several nodes in the layer. So it is rarely used in the inner layers. But it works out to be the best for the final output layer.

Most common architectures have a Softmax for the final output layers with a ReLU for the inner layers.

Neural Networks offered a major breakthrough in building non linear models. But that was not enough. Any amount of training and data is not enough if our model is not rich enough. After all, the amount of information contained in the model is limited by the number of weights therein. A simple network with a hundred weights cannot define a model for complicated tasks like face recognition. We need a lot more.

It may seem simple, just increasing the number of perceptrons in a network can increase the count of weights. What is the big deal about it? But that is not so simple. As the network grows larger, many other problems start peeping in. In general, the ability of the network does not grow linearly with the number of perceptrons. In fact, it can decrease beyond a point - unless we take care of some important aspects.

The capacity of the networks is a lot better when the network is deeper than wider - that is, if the network has a lot more layers rather than having too many perceptrons in the same layer. Such deep neural networks have enabled miraculous innovations in the past few years. Deep Learning is a branch of machine learning that looks into these aspects of Neural Networks. Deep Learning is the branch of Machine Learning that deals with deep Neural Networks.