k-NN Algorithm is a type of instance based learning method, commonly used in various machine learning algorithms. I have illustrated this concept using an example in MATLAB. Let’s see what Instance based learning is before moving to k-NN Algorithm.
Instance based learning
Instance based training methods store the training examples instead of forming an explicit description of target function using the training data provided. Whenever a new query instance comes, its relationship with the previous training data is evaluated to classify the new query instance. Nearest neighbor method and locally weighted regression method are some of the example of instance based learning methods. These methods assume that the instances can be represented as points in a Euclidean space. Instance based methods are also sometimes called lazy learning methods because the processing is delayed until a new instance needs to be classified.
One key advantage of these methods is that these approaches can create a different approximation to the target function for each distinct instance query. Some of the other advantages are
- Training is very fast
- Learn complex function easily
- Don’t lose information
One disadvantage is the high computation cost that might be required, since all computation takes place at classification time rather than when the training examples were obtained. This could be reduced by efficiently indexing training examples. Some of the other disadvantages are:
- Slow at query time
- Lots of storage
- Easily fooled by irrelevant attributes
k-NN is a type of instance based learning method where the function is approximated locally. The algorithm assumes all instances correspond to points in an n-dimensional space. The k-NN algorithm is among the simplest of all machine learning algorithms. It is also a non-parametric method. This means that we do not make any assumptions about the probability distribution of the underlying data and the variables being taken into account. Since fewer assumptions are made, the applicability is much wider, and due to reliance on fewer assumptions, more robust. Such methods could be of great help in dealing with real world, as in most situations the data does not follow a fixed theoretical assumption or trend.
k-NN for Classification:
In classification problems, we are given a query vector and we need to identify a particular class. That is, check particular class this query data would belong. Here, in the code shared below, I have taken
Query vector as: <Temp, Wind, Humidity>
Classification as Precipitation: <High, Low>
We are using the values of temperature (degree Celsius), Wind (kmph) and humidity (percentage) to predict if the chances of precipitation are high(>50%) or low(<50%).
Another term commonly used in k-NN algorithms is feature vector. Feature vector is a vector capable of defining a point in the coordinate space completely, which represents a past or future input.
Proceeding with the code:
k is the number of nearest neighbors we want to consider to evaluate a class/label. It would be advised to keep it as an odd number if the number of classes is even, so that we could have a majority.
k = 7; % Number of nearest neighbours we want to evaluate to find the result of current search query.
We next form a matrix with all the input. Most of the real world scenarios would involve continuous streaming of input data, based on time or some quanta. Hence we may like to make our input collection strategy compatible with an online system. Here, for brevity, I have saved the data into an excel sheet and extracting my data from there.
temp = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'a2:a24'); wind = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'b2:b24'); humidity = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'c2:c24'); precipitation = xlsread('E:\Work\kNN Exmple\TemperatureSheet.xlsx','sheet1', 'd2:d24');
Next, we proceed by calculating the distance between the query vector and training data. Here I have used the weather condition for the last 24 hours.
For distance, we can use any one of the below stated distance measures. I have used the Euclidean Distance. Using different distance measures and comparing the output could be a good practice, especially in regression models.
%Start(0) Calculating Euclidean Distance between the query points and %previous data. z1 = (query(1) - temp).^2; z2 = (query(2) - wind).^2; z3 = (query(3) - humidity).^2; euclideanDistance = z1 + z2 + z3; %Euclidean distance in square units %End (0)
Once we have got the distance measures between the query vector and the training data, we sort it in ascending order. This would give us a measure of which point lies closest to our test data.
Next step would be to find the class/label which is occurring the most number of times. On close examination, I realized that this is similar to finding the mode of the data. Hence, if we have discrete values in our data, we can use the mode utility of the MATLAB to arrive at the most frequent class. Here, I had coded High as 1 and Low as 0.
distance = [euclideanDistance precipitation]; % Appending the output(precipitation) vector to distance. sortedDistance = sortrows(distance); % sort the distance vector based on the proximity to search query. precipitation = mode(sortedDistance(1:k,2)); % Find the most frequently occuring class value( here, precipitation) if(precipitation == 1) disp('High') else disp('Low') end
Thus, this precipitation would be our answer to the query vector.
You can get the full code from:
Note: In many real time problems, it might be required to add this data to the set of training data. In such a scenario, we can append the query vector and the result at the end of our input sheet and refresh our training set.
where S is the covariance matrix.
Voronoi Diagram depicts region for the instance points such that if a query point x falls in that region, it would be closest to that point only. For a 1 Nearest Neighbor Algorithm, we can form this generalized diagram with all the test data points
In the figure below, if say a query point Y falls in the region A, then it can be classified as one with nearest characteristics with x1. The boundary denotes the nearest distance or the region closest to corresponding xn
Other Related Topics:
Weighted k-NN Algorithm, k-NN Regression, locally weighted regression, case-based reasoning, collaborative filtering
- Machine Learning by Tom Mitchell
- Coursera: Machine Learning by Pedro Domingos