Theory and implementation are often just poles apart. It is important to understand the theory. But it is even more important to understand how to put it into practice. For that, we need an idea about the kind of issues involved in solving real world problems and the typical solutions employed.

It is wise not to waste time in solving all problems right from the first principles. There are times when we must question these basics and try to think differently, but often it is wiser to start with these established best practices and see how things work.

Machine Learning experts have defined several such best practices. A lot of them may seem intuitive and trivial. Others seem to be nerdy. But, as we said, it is important to have these in mind when we start. That helps us focus on the real problem rather than the peripherals.

This is very important to simplify the journey. Most real life problems require a multi factor evaluation. Not just in machine learning, any solution is good if it runs fast and gives good values for all the output parameters.

How would we compare two solutions that give a trade off between them? How do we compare two solutions that give better accuracy on one output at the cost of the other? How would we compare that to a solution that gives similar, but higher error on both outputs?

These are some questions that we have to ask ourselves and translate our answer into numbers. This ensures we do not get stuck in adjectives like similar, better... Once we have a single numeric value for evaluation, the task of development gets a lot easier and independent of human monitoring.

There are different types of solution metrics. Sometimes, we have multiple outputs, where each very important - where we want to optimize each of them. At other times, optimization of a given output is no more meaningful if it crosses a given limit.

Thus, we have some satisfying metrics and some optimizing metrics. For a satisfying metrics, we just need to ensure that it crosses a given threshold. For optimizing metrics, we want to improve it on and on. In fact, we can also have a situation where it is more important to satisfy some metrics compared to others.

Common examples for multiple parameters are

- Positive and negative errors
- Short term and long term errors
- Output accuracy and run time

Based on the problem at hand, we need to identify the category, importance and relationship of each evaluation metric. Based on these, we should identify a single numeric formula that can be used to evaluate a given model. Until we get this single number, we will need human intervention at each step to see how a model affects the output.

It is true that any machine learning is meaningless without a well sized network and huge amount of data. But it is also true that we really do not need every thing right away. In fact, there is a lot that we need to do before we can fruitfully use all that we have.

It is very useful if we just start with a small setup - to get a feel of how things can move. The basic structure of the network can be discovered using a small amount of data. Doing this helps us come up with an initial stepping stone, with a relatively less amount of processing. With this in place, we can start adding more data and gradually enrich our model.

This has many advantages.

When we start with a small network and simpler model, we naturally avoid chances of over fitting. Moreover, when we train the model with a small subset of the whole data, with a small network, we have a good amount of dev and test sets. All this reduces the chances of over fitting.

Once we have an underfitting model, we have a baseline. With this baseline in place, we can then start with training the bigger model and the real data set. But now we have an advantage. With the baseline, we can be sure we have improved on the model rather than losing out to overfitting.

No model was trained in a single attempt. Many iterations are required to get a stable design that can be improved with further training. Any iteration can be very expensive. A lot of the finer aspects of such training requires a lot of data. But there are some grosser aspects that really do not need it. If we start with a small model and a subset of the data, we can quickly fix a lot of such aspects and then we can scale up to handle the finer aspects. We need fewer iterations with the whole data.

Machine learning is all about data and if data is not good, the outcome can never be good. Along with the training set, the dev / test sets are also play a significant role - often the dev / test sets have a much larger impact on the outcome.

Some important points that we must note in this regard:

Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data's distribution. Choose dev and test sets from the same distribution as far as possible.

There are several traditional heuristics - 60:20:20 / 70:10:20 .. And some say 99/0.5/0.5! All this depends upon the actual size of the available data. Your dev set should be large enough to detect meaningful changes in the accuracy of your algorithm, but not necessarily much larger. Your test set should be big enough to give you a confident estimate of the final performance of your system. And never forget that there is no point evaluating the model without training it. The training set should form the significant part of the story. We should extract a dev/test set just enough to server the purpose.

The dev set was meant to avoid overfitting. But it does happen after several iterations that we end up overfitting the dev set as well. One of the prominent symptoms of overfitting is that the error levels vary significantly over different data sets of the same distributions. Thus, if we notice a significant disparity between the dev set error and the test set error, it is quite likely that we have overfit the dev set. We should frequently refresh the dev set in order to avoid this problem.