Recurrent Neural Networks


Life can be understood backwards, but must be lived forwards - Soren Kierkegaard

That is a fact, we can learn only from the past and apply it in our decisions - hoping that the past is still relevant in the future. This is the basic assumption in all of machine learning.

In fact, most of our learning is based on a sequence of events in time. It is not just from a moment or a set of moments; but a sequence of moments. The sequence is very important and has a combined impact on what follows.

The basic machine learning algorithms that we have seen so far were based on a single chunk of inputs that produce an output. We train the models based on a huge set of such inputs and outputs and there we have a model read to generate the appropriate output for a new set of inputs. But this is for a defined set of simultaneous inputs.

But we may not be able to provide that. Consider for example, an application that predicts the next shot in a tennis match. Each shot is potentially impacted by all that has happened so far. If we want to train a model based on such a continuous sequence of events following each other, we cannot wait for them all to complete - and then start training the model. That does not make sense. Also, the output may be impacted by an input that came in a long time back. So, we need all the previous inputs for the training.

Well, we can generate a massive model with a huge number of inputs, that we set as the data is available. But considering the size and complexity of such a model, that looks weird. One would want to create a better way of handling such a problem.

Recurrent Neural Networks (RNN) is the solution for such a problem. In simple words, RNN takes its own output as one of the inputs - for the next training step. If we try to look into this a bit further, we can see that the output of one step has the information content of the inputs prior to that step. If that is passed into the input for the next step, we potentially pass in all that has happened so far into the input for the next step. Each new step has information from all previous steps as an input.

As you can see in the figure above, an output from one step is sequentially fed into the the next step.

Of course, this is an oversimplified theory. Real RNN's are pretty complex and the potential outcomes justify this complexity.

Vanishing & Exploding Gradients


The core concept of RNN sounds great. But there are some problems. The information content from older inputs decreases gradually. That is fine if we are sure that the output is reasonably defined by the last few inputs and nothing beyond that. The simple RNN's work great in such a scenario.

But, if that is not certain, then it is difficult to retain the information of an old input. At each step, the information from the new input tries to squeeze in, leaving lesser space for the previous ones. Well, we can avoid this by giving higher weights to the history. But then, we end up ignoring the present - which is not good either.

If we give a higher weight to the history, we have a problem of exploding gradients - older the data, more value it gets; and if we use lesser weight to the history, we have vanishing gradients - the old information is forgotten soon. One may consider tuning the hyperparameter to the golden mean where we avoid the two extremes. But that does not work either - because we do not need the whole of history, but just some specific element of the history - and we do not know which one.

For example, consider this page. The phrase "Neural Network" was not used for a long time. Yet, from the context, we all know we are talking about Neural Networks. For this, we do not need to remember each and every word in this document, but this particular word sequence that appeared long ago, is needed to define the context of all the text that follows.

Thus, the simple RNN's do a good job when we are ok with the vanishing gradients; ie, we are sure we need only the recent history. However, they get into trouble as the history grows and gets picky.