K-Nearest Neighbors (KNN) Algorithm
A Brief Introduction
In this blog, we’ll talk about one of the most widely used machine learning algorithms for classification, which is the K-Nearest Neighbors (KNN) algorithm. K-Nearest Neighbor (K-NN) is a simple, easy to understand, versatile and one of the topmost machine learning algorithms that find its applications in a variety of fields.
We’ll try to understand what is KNN, how it works, some common distance metrics used in KNN, its advantages & disadvantages along with some of its modern applications.
What is K-NN ?
K-NN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution i.e. the model structure is determined from the dataset.
It is called Lazy algorithm because it does not need any training data points for model generation. All training data is used in the testing phase which makes training faster and testing phase slower and costlier.
K-Nearest Neighbor (K-NN) is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure.
K-NN classification
In K-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
To determine which of the K instances in the training dataset are most similar to a new input, a distance measure is used. For real-valued input variables, the most popular distance measure is the Euclidean distance.
The Euclidean distance
- The Euclidean distance is the most common distance metric used in low dimensional data sets. It is also known as the L2 norm. The Euclidean distance is the usual manner in which distance is measured in the real world.
where p and q are n-dimensional vectors and denoted by p = (p1, p2,…, pn) and q = (q1, q2,…, qn) represent the n attribute values of two records.
- While Euclidean distance is useful in low dimensions, it doesn’t work well in high dimensions and for categorical variables. The drawback of Euclidean distance is that it ignores the similarity between attributes. Each attribute is treated as totally different from all of the attributes.
Other popular distance measures :
- Hamming Distance: Calculate the distance between binary vectors.
- Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance.
- Minkowski Distance: Generalization of Euclidean and Manhattan distance.
Steps to be carried out during the K-NN algorithm are as follows :
- Divide the data into training and test data.
- Select a value K.
- Determine which distance function is to be used.
- Choose a sample from the test data that needs to be classified and compute the distance to its n training samples.
- Sort the distances obtained and take the k-nearest data samples.
- Assign the test class to the class based on the majority vote of its k neighbors.
Performance of the K-NN algorithm is influenced by three main factors :
- The distance function or distance metric used to determine the nearest neighbors.
- The decision rule used to derive a classification from the K-nearest neighbors.
- The number of neighbors used to classify the new example.
Advantages of K-NN :
- The K-NN algorithm is very easy to implement.
- Nearly optimal in the large sample limit.
- Uses local information, which can yield highly adaptive behavior.
- Lends itself very easily to parallel implementation.
Disadvantages of K-NN :
- Large storage requirements.
- Computationally intensive recall.
- Highly susceptible to the curse of dimensionality.
K-NN Algorithm finds its applications in :
- Finance — financial institutes will predict the credit rating of customers.
- Healthcare — gene expression.
- Political Science — classifying potential voters in two classes will vote or won’t vote.
- Handwriting detection.
- Image Recognition.
- Video Recognition.
- Pattern Recognition.