Affinity Propagation with SkLearn


This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail. The algorithm requires us to provide two sets of data:

  • Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
  • Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.

Implementation


Let us now check out how this could be implemented in Python using ScikitLearn. To generate the training data, ScikitLearn provides us a method make_blobs

We can start by importing the required modules

from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import AffinityPropagation
from sklearn.model_selection import train_test_split

Now, we can generate the required data. Note that since this is a clustering application. We do not really need the target output. But, we generate the same just to make sure our model generates good clusters.

We can have a look at the data we got

X.shape
...
(500, 5)
y.shape
...
(500,)
X[:2,]
...
array([[ 0.7927672 ,  4.69609538,  2.29690233,  1.00159957, -2.3564752 ],
        [ 5.60928139, -0.0376682 , -0.68774591,  9.60509046, -8.52337837]])
y[:2,]
...
array([0, 2])

Now, create an instance of the AffinityPropagation and try to fit the data

ap = AffinityPropagation()
ap.fit(X)

Now we need to check how good is this clustering. The values in y were the clusters assigned by the make_blogs. They may not match exactly with the clusters generated by the Affinity Propagation. For example, cluster 1 in the original data could be cluster 2 here. That is not important. But it is important that the clustering is similar. That is, if two elements were in the same cluster in the first set, they should be in the same group in the second set as well. To verify this, we can write a small script:

y1 = ap.predict(X)
mismatch = 0
for i in range(y1.shape[0]):
    for j in range(y1.shape[0]):
        if ((y[i] == y[j]) and (y1[i] != y1[j])):
            mismatch = mismatch+1

print(mismatch)

The output is 0! That means there is not a single mismatch. Affinity Propagation is very effective. But very expensive. Everything comes at a cost!

There are several hyper parameters that are required to make the Affinity Propagation effective.

Damping


Each new iteration of the Responsibility and Availability leads to recalculation of the clusters. Because of this, it is possible that the system gets into an oscillation without any convergence. In order to avoid this problem, the algorithm introduces Damping. This forces inertia on the changes by allocating some weight to the current position of the data point.

Affinity


The measure of affinity is perhaps the most important component in this exercise of clustering. Generally the euclidean affinity is used. But it is possible to push in any other measure of affinity - so long as it is consistent.