Classification is an important subject in Supervised Learning. A major part of machine learning applications deal with binary outputs which require classification rather than regression. KNN Classifier is one of the many classification algorithms.
Like most Machine Learning algorithms, the KNN algorithm is also inspired by a tendency of the human mind - to go along with the crowd. Conceptually, KNN just looks at the known points around the query point and predicts that its outcome is similar to the points around it. More precisely, For any new point, it checks for the K points that are closest in terms of the defined distance metric. Once these are identified, the outcome of each of those points is identified based on the training set. And the outcome of the new point is defined based on the highest bidder in the neighborhood. For example, if we look for the 5 nearest of a given test point, if 3 of those points say positive and two say negative, the outcome is predicted as positive since that is the highest bidder.
In Python code. we can use SciKitLearn to do this very easily. The foremost step is to Import the necessary packages
from sklearn.datasets import load_breast_cancer from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split
The SciKtLearn package comes with several data sets built into it. We can use the breast cancer dataset to start with this example
cancer = load_breast_cancer()
You can get a glimpse of what goes into this dataset
cancer.data.shape (569, 30)
It has 569 records with 30 features per record. That is too small to get a useful model, but good enough for learning. These are the features for each record in the data set.
['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
It has only two target values for classification. Either it is malignant of benign
Now we can use the KNN Classifier provided by SciKitLearn to process the data. You can start with splitting the data into the training and test sets.
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data,cancer.target, random_state=42)
Next, instantiate an object of the KNN Classifier. For now, just let it pick up the default values.
knn = KNeighborsClassifier()
Next, you can use the fit() method to update the model in order to fit the training set.
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
You can now evaluate the model using the scores for the training and the test sets.
>>> knn.score(X_train, Y_train) 0.93427230046948362 >>> knn.score(X_test, Y_test) 0.965034965034965
Note that the score for the training set is quite similar to the score for the test set. This means that we do not have a problem of over fitting. The score of 0.93 may or may not be good - based on the requirements of the application. We can try to improve on it by tuning some of the hyperparameters.
The KNN Classifier is a good tool for classification when the size of the data and the features are within control - else the computation can get very expensive. The classifier accuracy is based on the assumption that the similar points are geometrically close to each other - which may not always be the case. Consider for example, data sets in form of two concentric circles - the inner circle being positive and the outer circle negative. In such a case, inventing new distance metric may help to an extent. But the cost of computation increases very rapidly with the complexity of the distance metric. The cost also increases rapidly with the number of features.
But it is a very elegant and intuitive way of classifying when the data is good.