k-NN Algorithm

k-NN Algorithm is a type of instance based learning method, commonly used in various machine learning algorithms. I have illustrated this concept using an example in MATLAB. Let’s see what Instance based learning is before moving to k-NN Algorithm.

Instance based learning

Instance based training methods store the training examples instead of forming an explicit description of target function using the training data provided. Whenever a new query instance comes, its relationship with the previous training data is evaluated to classify the new query instance. Nearest neighbor method and locally weighted regression method are some of the example of instance based learning methods. These methods assume that the instances can be represented as points in a Euclidean space. Instance based methods are also sometimes called lazy learning methods because the processing is delayed until a new instance needs to be classified.

One key advantage of these methods is that these approaches can create a different approximation to the target function for each distinct instance query. Some of the other advantages are

  1. Training is very fast
  2. Learn complex function easily
  3. Don’t lose information

One disadvantage is the high computation cost that might be required, since all computation takes place at classification time rather than when the training examples were obtained. This could be reduced by efficiently indexing training examples. Some of the other disadvantages are:

  1. Slow at query time
  2. Lots of storage
  3. Easily fooled by irrelevant attributes

k-NN Algorithm

k-NN is a type of instance based learning method where the function is approximated locally. The algorithm assumes all instances correspond to points in an n-dimensional space. The k-NN algorithm is among the simplest of all machine learning algorithms. It is also a non-parametric method. This means that we do not make any assumptions about the probability distribution of the underlying data and the variables being taken into account. Since fewer assumptions are made, the applicability is much wider, and due to reliance on fewer assumptions, more robust. Such methods could be of great help in dealing with real world, as in most situations the data does not follow a fixed theoretical assumption or trend.

k-NN for Classification:

In classification problems, we are given a query vector and we need to identify a particular class. That is, check particular class this query data would belong. Here, in the code shared below, I have taken

Query vector as: <Temp, Wind, Humidity>

Classification as Precipitation: <High, Low>

We are using the values of temperature (degree Celsius), Wind (kmph) and humidity (percentage) to predict if the chances of precipitation are high(>50%) or low(<50%).

Another term commonly used in k-NN algorithms is feature vector. Feature vector is a vector capable of defining a point in the coordinate space completely, which represents a past or future input.

Proceeding with the code:

k is the number of nearest neighbors we want to consider to evaluate a class/label. It would be advised to keep it as an odd number if the number of classes is even, so that we could have a majority.

 k = 7; % Number of nearest neighbours we want to evaluate to find the result of current search query. 

We next form a matrix with all the input. Most of the real world scenarios would involve continuous streaming of input data, based on time or some quanta. Hence we may like to make our input collection strategy compatible with an online system. Here, for brevity, I have saved the data into an excel sheet and extracting my data from there.

temp          = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'a2:a24');
wind          = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'b2:b24');
humidity      = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'c2:c24');
precipitation = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'd2:d24');

Next, we proceed by calculating the distance between the query vector and training data. Here I have used the weather condition for the last 24 hours.

For distance, we can use any one of the below stated distance measures. I have used the Euclidean Distance. Using different distance measures and comparing the output could be a good practice, especially in regression models.

%Start(0) Calculating Euclidean Distance between the query points and
%previous data.

z1 = (query(1) - temp).^2;
z2 = (query(2) - wind).^2;
z3 = (query(3) - humidity).^2;

euclideanDistance = z1 + z2 + z3; %Euclidean distance in square units

%End (0)

Once we have got the distance measures between the query vector and the training data, we sort it in ascending order. This would give us a measure of which point lies closest to our test data.

Next step would be to find the class/label which is occurring the most number of times. On close examination, I realized that this is similar to finding the mode of the data. Hence, if we have discrete values in our data, we can use the mode utility of the MATLAB to arrive at the most frequent class. Here, I had coded High as 1 and Low as 0.

distance       = [euclideanDistance precipitation]; % Appending the output(precipitation) vector to distance.
sortedDistance = sortrows(distance); % sort the distance vector based on the proximity to search query.

precipitation  = mode(sortedDistance(1:k,2)); % Find the most frequently occuring class value( here, precipitation)

if(precipitation == 1)

Thus, this precipitation would be our answer to the query vector.

You can get the full code from:


Note: In many real time problems, it might be required to add this data to the set of training data. In such a scenario, we can append the query vector and the result at the end of our input sheet and refresh our training set.

Distance Measures



Mahalanobis Distance:


where S is the covariance matrix.

Ln-norm distance:


Voronoi Diagram

Voronoi Diagram depicts region for the instance points such that if a query point x falls in that region, it would be closest to that point only. For a 1 Nearest Neighbor Algorithm, we can form this generalized diagram with all the test data points

In the figure below, if say a query point Y falls in the region A, then it can be classified as one with nearest characteristics with x1. The boundary denotes the nearest distance or the region closest to corresponding x



Other Related Topics:

Weighted k-NN Algorithm, k-NN Regression, locally weighted regression, case-based reasoning, collaborative filtering




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s