As the name suggests, Supervised Learning is learning under supervision. Consider for example, the problem of predicting real estate prices based on the varioius parameters like number of rooms, the floor, locality, etc. We do not have a precise equation for this. But we know they are related. We also have a good amount of data about the recent transactions on real estate. We can use this data to identify the relation between the variable parameters and the prices - the input and the output.

Essentially it involves five steps:

- Guess an equation that relates the input and output values.
- Use the equation to calculate the output for the given input
- Compare this calculated output with the real values available with us and identify the "error"
- Alter the equation in a way that we expect will reduce the error.
- Continue doing this till we reach a point where the error is within limits or any change increases the error again.

This is called Supervised Learning. For this to work, we need:

- Good amount of data that can help us map the input and output.
- A good set of parameters that rightfully define the price.

In technical terms, the training data is used to train a model in a way that we reach a point of minimum cost.

There are two major types of problems in supervised learning. Regression and Classification. Regression deals with continuous data and classification (as the name suggests) deals with classification of data into discrete output values.

In order to understand supervised learning, we should first understand some very basic concepts in there. Nothing new about these concepts. But, having a concrete name assigned to the idea beneath, makes it much easier to deal with it in every sense.

The Hypothesis is the assumed relation between the input and output. This hypothesis is verified and improved on each iteration. In the implementation level, this could be the individual coefficients in a polynomial expression or nodes of a neural network, etc. Essentially it is a relation that we propose for correction and refining through the process of learning.

The hypothesis relation is identified by parameters that can be altered to make it fit the given data set. For example, in a polynomial expression, the coefficients are the weighs. Typically these are denoted by θ_{0}, θ_{1} ...

The cost of a hypothesis is a measure of the overall gap between the actual expected output and the output calculated by the hypothesis. This cost is naturally a related to the weights that we use to define the hypothesis. Cost function is the formal definition of this cost - as a function of these weights. Given the cost function, a good hypothesis is one with minimal cost.

Typically, the regression process of supervised learning begins with proposing a relation between the input and output. This relation could be as simple as a linear equation or it could be a polynomial or even a massive neural network. Each such relations have many parameters. For example, a linear equation y = ax + b has parameters a and b - which define the equation. The kind of proposal depends upon the analysis of the situation, availability of data and of course the experience of the developer. But once a good model is defined, the next step is to identify the values of these parameters.

Thus, the process of regression consists of multiple iterations of forward propagation and backward propagation. These two steps consist of first finding the error based on the current weights (forward propagation) and then updating the weights based on the last calculated error (backward propagation). Multiple iterations of forward and backward propagation can gradually refine the model. This is the essence of regression - the most fundamental concept in supervised learning.

Gradient descent is one of the most popular methods for identifying the parameters of a learning model. Essentially gradient descent consists of identifying the Error content of the model - for the available data. Then, gradually updating the parameters of the model in a way that the best error reduction is ensured on every step. This can be visualized as a ball rolling down a curved surface. At every point it moves in the direction that gives the best reduction in its height. In order to perform Gradient Descent, we need to obtain a good measure of the error function.

In traditional gradient descent, we run the forward and backward propagation over the entire data - to identify the error function and then try to regressively optimize the model. If this is computationally expensive, why not run through one data point at a time? Stochastic gradient descent consists of training over one data sample at a time. The learning curve of stochastic gradient descent is not as clean as the classical gradient descent algorithm, but it does converge very quickly if the hyperparameters are chosen well. But, stochastic gradient descent can be disasterous if we get stuck with the wrong hyperparameters.

A major advantage is when we do not have all the data available right away - for example data streaming into the system over the internet.

Stochastic gradient descent is an extreme just as Gradient descent. We all know extremes are limited. Mini Batch Gradient Descent tries to merge the benefits of both by going through the process in small batches.

This is an important part of the story. The efficiency of the gradient descent method depends heavily on the function that we use to represent the error for the proposed parameters.

- Mean Square Error - used for continuous regression
- Cross Entropy - used for classification

The cost function can be seen as a cumulative of the error function

Any amount of effort on minimizing the cost function will not help if the hypothesis is not rich enough. For example, if you want to fit a complex data in a simple linear expression, that just can't work. This is scenario is called underfitting.

This is the other extreme of underfitting. It is also possible that the hypothesis is excessively rich. In such a case, the hypothesis perfectly fits the available dataset, but it curves heavily between these points, making it very bad for any data that is not in the input set. Such a hypothesis will seem perfect while training, but will make absurd predictions when tested on another dataset. For example, we can have a high order polynomial fitting a set of points in a straight line. While training we may feel we have done a great job that fits with zero error. But for any point out of the training set, the calculations will go for a toss.