The K-nearest neighbor (KNN) algorithm is a type of supervised machine learning algorithm. KNN is extremely easy to implement in its most basic form, while also performing quite complex classification tasks. It is a lazy learning algorithm as there is no specialized training phase. Instead, it uses all the data for training when classifying a new data point or instance. KNN is a non-parametric learning algorithm, meaning it doesn’t assume anything about the underlying data. This is an extremely useful function because most of the real data does not meet theoretical assumptions, e.g. linear separation, homogeneous distribution, etc.
The intuition behind the KNN algorithm is one of the simplest of all supervised machine learning algorithms. It simply calculates the distance of the new data point to all other training data points. The distance can be of any type, such as Euclid or Manhattan etc. It then selects the closest data points K, where K can be any integer. Finally, it assigns the data point to the class to which most of the K data points belong
The k-nearest neighbor algorithm is based on the simple idea of predicting unknown values by matching them to the most similar known values.
Let’s say we have 3 different types of fruit. We know their name, color and size. And we get a new fruit to classify, determine the color and size, and put it in a set of other fruits. We choose the number K (here: 3), we look for the 3 closest neighbors of our new fruit (taking into account the color and size) and assign the same name to the fruit as most of the neighbors have.
The number of neighbors we use for the k nearest neighbors (k) can be any value less than the number of rows in our dataset. In practice, looking at several neighbors makes the algorithm work better because the less similar your neighbors are to your data, the worse the prediction will be.
KNN can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we look at 3 important aspects:
- Ease of interpretation of results
- Time of calculations
- The power of predictability
The measure of similarity depends on the type of data. For real-valued data, the Euclidean distance can be used. Other other types of data can be used, such as categorical or binary data, Hamming distance. For regression problems, you can return the mean of the predicted attribute. For classification, the most popular class may be returned.
Our data contain numerous animals, their attributes and are assigned to some class – mammal (1), bird (2), reptile (3), fish (4), amphibian (5), insect (6), invertebrate (7).
import pandas as pd data = pd.read_csv('D:\machinelearning\zoo.csv');data.head()
We will reduce the number of features for our model, then divide the data into the dependent variable (y) and independent (X), and then divide the entire set into training and test:
import numpy as np df = data[['hair', 'milk', 'aquatic', 'eggs', 'class_type']] X = np.array(df.iloc[:, 0:4]) y = np.array(df['class_type'])
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
There is a ready made function in scikit-learn for KNN computation, but first let’s see how it’s done from scratch. We will create some predefined functions that calculate the Euclidean distance between the „new” observation and all data points in the training set, then select the closest K that got the majority of votes and assign a new point (observation) to the class.
def train(X_train, y_train): return
from collections import Counter def pred(X_train, y_train, x_test, k): distances =  targets =  for i in range(len(X_train)): distance = np.sqrt(np.sum(np.square(x_test - X_train[i, :]))) distances.append([distance, i]) distances = sorted(distances) for i in range(k): index = distances[i] targets.append(y_train[index]) return Counter(targets).most_common(1)
def KNN(X_train, y_train, X_test, predictions, k): train(X_train, y_train) for i in range(len(X_test)): predictions.append(pred(X_train, y_train, X_test[i, :], k))
from sklearn.metrics import accuracy_score predictions =  KNN(X_train, y_train, X_test, predictions, 3) print(accuracy_score(y_test, predictions))
Our model does not look good, now in a few lines we will get a similar result (not necessarily the same – random) with a ready function for KNN from scikit-learn, it looks like this – at the beginning our k = 3:
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=3)
We will adjust our model to the data:
knn.fit(X_train, y_train) pred = knn.predict(X_test) print (accuracy_score(y_test, pred))
The result is not the best, let’s see if we can improve it by choosing the optimal number of k.
from sklearn.model_selection import cross_val_score list_a = list(range(1,15)) cv_scores =  for k in list_a: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy') cv_scores.append(scores.mean())
We can visualize the optimal value of k taking into account the classification error:
import matplotlib.pyplot as plt MSE = [1 - x for x in cv_scores] optimal_k = list_a[MSE.index(min(MSE))] print ("Optimal k - %d" % optimal_k) plt.plot(list_a, MSE) plt.xlabel('value K') plt.ylabel('Error') plt.show()
Optimal k - 1
We get the lowest error when k = 12, so we use this value in our model:
knn = KNeighborsClassifier(n_neighbors=12) knn.fit(X_train, y_train) pred = knn.predict(X_test) print (accuracy_score(y_test, pred))
Choosing the optimal k (12), the result increased to the accuracy of 50%, which is better than k = 3, but as a predictive model it is rather poor.
KNN may suffer from skewed class schedules. For example, if a certain class is very common in a training set, that class will dominate the majority vote in the new example (large number = more common). Finally, KNN accuracy can be severely degraded by large dimension data as there is not much difference between the nearest and farthest neighbor.