Unsupervised learning, as the name suggests, is about learning without supervision. What does it mean to learn without supervision? What can you learn without supervision? As we saw in the regression, the supervision comes from a part of the input data set - that is required to ascertain the correctness of your hypothesis. The input data has the questions as well as correct answers - the machine just needs to map them.

Unsupervised learning is quite different from this. Here, we do not have any answers. No questions either! Then what do we have? We have raw data and we have no idea about its structure. There is nothing to learn from. Unsupervised learning is just making sense of the data in hand. This is not just a theoretical fantacy. Unsupervised learning has a great application for analyzing data that we know nothing about.

When faced with such a situation, of huge chunk of unknown data, that natural tendency is to categorize it into different parts based on the parameters we already know. Then, we can identify the tendencies of each of these clusters so that we can get some meaningful mapping of the data. If your parameters are wide enough, your predictions will be correct too.

Clustering is one of the most important problem solved using unsupervised learning. It is used to discover the inherent groupings in the input data. For example: Grouping companies based on their revenues or grouping countries based on their GDP, etc. Since it is a form of unsupervised learning, there is no feedback or verification while learning. The grouping is purely based on the values associated with the input samples.

In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Such clustering is a common technique for statistical data analysis used in many fields.

Clustering and Unsupervised Learning in general forms an important part of any real life AI development. Most of the data inputs for today's AI development is from the internet and mobile devices. This is raw data fetched out of the chaos called internet. Not everything is useful for the particular problem. The available data has to be assorted to identify what could be useful. Clustering is one of the primary approaches of doing this.

There are many different algorithms used for clustering. Principally, there are three main types of algorithms for doing this.

In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster. In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters. Agglomerative methods are used much more often than divisive methods. Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.

There are different ways we can connect two points in the hierarchy. Each has its own nuances.

- Single Linkage: In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters with the smallest single linkage distance.
- Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance.
- Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.

In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.

This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.

In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first "good" guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm. The most commonly used non-hierarchical method is MacQueen's K-means method.

This has some advantages and some disadvantages compared to the Hierarchical algorithms. The main question here is guessing the right value for k. If there are k+1 distinct clusters and we start off with k, we will never have a good result. However, if the cluster chunks are not so distinct, a non-hierarchical cluster can help in segregating the data rather quickly.

A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. Examples are Affinity Propagation, MeanShift, DBSCAN.

Any clustering is meaningful in the context of a given measure of association or a measure of distance. There are several definitions of the association or distance between two points.

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm

`d(X1, X2) = (sum((X1(i) - X2(i))^2))^(1/2)`

The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. Essentially, it is a generalization of the Euclidean distance to power of n

`d(X1, X2) = (sum((X1(i) - X2(i))^n))^(1/n)`

The Canberra Metric is intuitively quite different from the Euclidean distance. In a gross sense, one can think of it as a measure of angle in the Euclidean space.

`d(X1,X2) = sum(abs(X1(i) - X2(i)) / (abs(X1(i)) + abs(X2(i))))`

Similar to Canberra Metric, the MetricCzekanowski Coefficient also measures an angular distance, but sums over the values before taking a ratio.

`d(X1,X2) = 1 - 2 * sum(abs(X1(i) - X2(i))) / sum(abs(X1(i)) + abs(X2(i)))`

Or we can create our own. Any distance measure should satisfy some properties

Distance between point X1 and X2 should be same as distance between point X2 and X1.

`d(X1, X2) = d(X2, X1)`

Distance between two distinct points X1 and X2 should be greater than 0

`d(X1, X2) > 0 if X1 != X2`

Distance of a point with itself should be 0

`d(X, X) = 0`

The generic triangular inequality should also hold for the new measure of distance

`d(X1, X2) + d(X2, X3) >= d(X1, X3)`