Unsupervised learning, literally means learning without supervision. Supervised learning is based on training data. We have a large set of observations and the model is trained in order to generalize the relation between the input and output. But what would we do if such "labeled data" is not available?

It is a common experience that we simply observe our surroundings to make some sense out of it. We have no details about what we would observe, but based on the observations, we get some information about the surroundings. Introspection would tell us that the first and foremost step in such a situation is identifying groups among observations - identifying if an observation is more like one than the others.

This is called clustering. Below are some of the important algorithms that are used for clustering a large data set.

K-Means is an interesting way of identifying clusters in the given data. Conceptually, we can think of the process as follows:

If we want to identify K clusters in the data set; we start with picking K separate points in the data space. Assuming these K points are the centroids, identify the K clusters. Any point in the data space belongs to the cluster defined by the closest centroid. Now, based on these clusters identify the new centroids. And then identify the clusters again. If we do this recursively, we should ideally end up with a situation where the K centroids do not move much on each iteration. The clusters are thus identified using the K-Means algorithm.

Ofcourse, there are several important factors here. How do you choose K? If K is very large, we might end up with a cluster for every data point. That defeats the purpose of clustering. On the other hand, if we use K = 1, we have one big cluster for the entire data set. This again defeats the purpose. Ideally, we should choose K such that the clusters identified are indeed different from each other - That is the distance of all points from the respective centroid is significantly less than the distance between the centroids.

More formally, sum of square of distance between centroid and the data points within each cluster is considered a metric for optimizing the model. As the value of K increases, the sum of square of distances decreases for every step. But this decreases is very fast for the initial values of K, and then it slows down. This defines the right value of K.

In real life, it is very difficult to know the number of such details before we start. But we still have use cases for K-Means - in cass where we want to break the data into a predefined number of clusters. Else, if we have enough computational power, we can try over various number of clusters - regressively trying to reduce the mismatch. That is certainly not the best way out.

We have other more effective algorithms. But they are not as simple.

This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail.

The algorithm requires us to provide two sets of data:

- Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
- Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.

All this information is normally represented through a single matrix. The main diagonal of this matrix represents preferences. Another approach is to keep the similarities between points in the memory. (The latter works better when the data is sparse). The 'exchanging messages between points' is nothing more than matrices manipulation. The algorithm then runs through a number of iterations, until it converges.

Thus, each iteration has two steps:

- Calculating responsibilities
- Calculating availabilities.

Responsibility and Availability are two basic components of Affinity Propagation. To understand them, consider the scenario of a group of employees working with an employer. From the team to work, you have to ensure that the employees find the employer better than other employers around. Additionally you have to ensure that the employees get along well. The first is measured in terms of Responsibility while the second is measured in terms of Availability.

In mathematical terms, the Responsibility r(i, k) reflects the accumulated evidence of k's potential for being an exemplar of i - considering the various different possible exemplars for i. On the other hand, Availability - a(i, k) is the accumulated evidence for k's potential of being an exemplar of i - considering the various different points that consider k as an exemplar. Essentially, the "Responsibility" measures k's responsibility to take up i under its hood. On the other hand, "Availability" measures k's availability for the new point i. While working on the classification, the Responsibility message is passed from i to k and Availability message is passed from k to i. That is, i tells k if it thinks k is responsible for i and k replies if it feels that it is available for i.

In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial apriori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration.

Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points.

Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar.

There are two major hyper parameters that are required to make the Affinity Propagation effective.

Each new iteration of the Responsibility and Availability leads to recalculation of the clusters. Because of this, it is possible that the system gets into an oscillation without any convergence. In order to avoid this problem, the algorithm introduces Damping. This forces inertia on the changes by allocating some weight to the current position of the data point.

The measure of affinity is perhaps the most important component in this exercise of clustering. Generally the euclidean affinity is used. But it is possible to push in any other measure of affinity - so long as it is consistent.

Mean shift clustering uses sliding-window to find dense areas in the data points. It is a centroid-based algorithm. The goal of the algorithm is to locate the centre points of each group/class, which works by updating candidates for centre points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of centre points and their corresponding groups. The assumption is that the population is dense at the centre of each cluster.

Mean-shift algorithm starts by selecting a random point in the data space. Then, check the data in a fixed radius around the point, to identify the direction where the density increases most. The point continues to shift in the data space in order to move in the direction where the density increases most. This continues till you reach a point where the density decreases on every side. This is the centre of the group. This process is done with many sliding windows until all points lie within a window. When multiple windows overlap the merge and the maximum is preserved. The data points are then clustered according to the sliding window in which they reside.

This is different from the K-means algorithm because we do not select the number of clusters. The mean shift algorithm identifies this for us. That adds a lot of value. Ofcourse it leaves the room for configuration (hence room for doubt) because we have to select the radius of the sliding window. The radius in turn impacts the count of clusters identified.

This limitation of the algorithm is that it identifies the cluster based on the density of points and if an entire cluster is not dense enough, it is pushed into another dense cluster. Thus, it may miss some clusters. But the algorithm offers a very good computational performance.