Overfitting is one major problem that can lead to disappointment after a lot of effort. Underfitting is not so bad because we know about it when training. Overfitting is a surprise on the fields! Hence, it is important to eliminate possibility of overfitting. There are various techniques to limit this problem. Few of the important ones are:
The input data is split into 3 parts: The training set - that is used to train the hypothesis. Once this is done, the trained hypothesis is verified on the dev set. If there is no underfitting or overfitting, it is expected that the cost for the dev set would be similar to what was observed in the training set. If the cost for Dev set is higher, it means that the hypothesis is overfitting the training set.
If either cost is much higher that acceptable, it means that the hypothesis underfits the available data. In such a case, we need to try out a richer hypothesis. Once the hypothesis is finalized by working on the train/dev set, then it should be tested on the test set - to make sure all is well.
The golden rule for splitting the input data into these three was 60:20:20 %. But, lately, we have seen a huge burst in the available input data. When we have millions of records in the input data, it does not make sense to push 40% of it into dev/test sets. Hence, most of it is used in the train set, so long as we have a few thousand records for the dev/test set each.
Three fold splitting of available data into the train-dev-test proved quite useful. So people developed the concept of cross validation. This takes the concept another step further.
Here, we divide the available data into N folds. Each of these folds is then sequentially used to test the training by the rest of the data. For example, we start with the first fold. Here we train the data with the remaining N-1 folds. Once we have trained it, we test it with the first fold that we had reserved. We then do the same with the second fold, then the third . . and so on.
Thus we run the process N times - to get a much better result. Of course this also requires a lot more processing. Another important limitation is that we need to gather the entire data set in one pool and then play with it over and over. This may seem simple with a small amount of data. But when we talk in Tera Bytes and Peta Bytes, the calculations are quite different.
A common symptom of an overfitting curve is that some of the higher weights are too large. "Large and small" are very vague terms. But for any two models, if the training error is similar, one can reasonably guess that the one with larger values for higher weights is overfitted. In other weights, for the same error in the training, one can consider the "Error" is higher in the model with larger weights. To account for this we can make a small change to the error function and add a component that tends to penalize higher weights. Thus, the gradient descent will naturally move to the model with lower weights - thus avoiding overfitting.
Depending upon how the weight error is calculated, we have two types of weithg penalty - L1 & L2. L1 penalty is calculated by summing over the absolute value of the weights and the L2 penalty is calculated using their mean square value. Each has its own peculiarities and advantages. We cannot use derivatives with L1 penalty - so it cannot be used with simple gradient methods. It requires solving a convex optimization problem. But it has some major advantages. It drives some weights exactly to zero and it learns sparse models. This is particularly useful when we have too many features. Regularization based on L1 penalty helps us identify the important ones. L2 penalty is based on mean square sum of the weights. This is differentiable and can gracefully fit into the gradient descent algorithm.
It is often found that all the features are not orthogonally independent. In such cases, it does help to spend some time trying to rework the features into orthogonal features - that are much less in number. With the reduction in the number of features, the problem of gradient descent is much simpler and the chances of overfitting are significantly reduced.
It is not a simple task to identify orthogonal features when we have thousands of them - and we know nothing about their interdependence. But we have some interesting algorithms that help us with the task.
At times, we notice that the value added by some features is less than the problem. In such scenario, it helps to just discard some dimensions.
The root of overfitting is the limitted training data set - because the amount of data available is much less than what is required to train the model for all its features. Naturally there are two ways to solve this problem - reduce the number of features or increase the data. Dimensionality reduction takes the first approach and Data augmentation takes the second.
It is often possible to use the existing data set to generate a larger amount of data to better train the model. Of course it is not possible to generate more information that is already available in the available data. But it is possible to pump up the information using your knowledge about the domain and the model being trained. For example, image data can be easily augmented by flipping the images of sliding about etc - because we know how the images work.
Another way to ensure that no single weight can drive the model crazy is skipping or forcing some weights to zero on every sample processed in the of the iteration. This makes sure each weight independently learns the information well enough to contribute to the net output rather than developing a tendency to drag it away.
Again, this may not be so intuitive. To understand this, we can look at overfitting from another perspective. Overfitting is caused when the model is too rich. In such a scenario, we end up with some parts of the model that do not add value to the model. We need to ensure that each component of the model adds value - thus the overall model is better than each component.
To do this, we train parts of the model and put them together. This works well. But if the parts of the model are not united, we get little value compared to the scenario that entire model goes together. To get the best of both the worlds, we implement dropouts. Here, we randomly drop parts of the model - set some weights to 0 for each epoch.
This ensures that each part of the model works in coordination with the rest of the model and also each weight contributes value to the entire model.
Of course, we need good amount of data to train the models well and avoid overfitting. The above methods can help us get more and more value out of the available data. But that cannot eradicate the need for data. If nothing else works, we have no choice but to go back and collect more data.