Skip to main content

KNN(K-Nearest Neighbour) algorithm, maths behind it and how to find the best value for K

 KNN is a powerful classifier and a regressor. yes, you got it right we can do both regression or classification by this algorithm. For its implementation in python please visit this link.

Image for post

What is KNN and how it works:

Let’s head by setting some definitions and notations. We will take x to denote a feature and y to denote the target.

KNN falls in the supervised learning algorithms. This means that we have a dataset with labels training measurements (x,y) and would want to find the link between x and y. Our goal is to discover a function h:X→Y so that having an unknown observation x, h(x) can positively predict the identical output y.

Working

First, we will talk about the working of the KNN classification algorithm. In the classification problem, the K-nearest neighbor algorithm essentially said that for a given value of K algorithm will find the K nearest neighbor of unseen data point and then it will assign the class to unseen data point by having the class which has the highest number of data points out of all classes of K neighbors.

For distance metrics, we will use the Euclidean metric.

Image for post

Finally, the input x gets assigned to the class with the largest probability.

Image for post

For Regression the technique will be the same, instead of the classes of the neighbors we will take the value of the target and to find the target value for the unseen datapoint by taking an average, mean or any suitable function you want.

Ideal Value for K

Now most probably, you are wondering how to decide the value for variable K and how it will affect your classifier. Well, like most machine learning algorithms, the K in KNN is a hyperparameter that you, as a data scientist, must decide in place to get the most suitable fit for the data set.

When K is small, we are holding the region of a given prediction and pushing our classifier to be “more blind” to the overall distribution. A small value for K provides the most adjustable fit, which will have low bias but high variance. Graphically, our decision boundary will be more irregular. On the other hand, a higher K averages more voters in each prediction and hence is more flexible to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.

Image for post

Improvements

  • An easy and mild approach to change skewed class distributions is by implementing weighted voting.
  • Changing the distance metric (i.e. Hamming distance for text classification)
  • Dimensionality reduction techniques like PCA should be executed prior to applying KNN and help make the distance metric more meaningful.

Thanks for browsing my pattern, and I hope it benefits you in theory and in practice!!!!

Comments

Post a Comment

Popular posts from this blog

Random Forest and how it works

  Random Forest Random Forest is a Machine Learning Algorithm based on Decision Trees. Random forest works on the ensemble method which is very common these days. The ensemble method means that to make a decision collectively based on the decision trees. Actually, we make a prediction, not simply based on One Decision Tree, but by an unanimous Prediction, made by ‘ K’  Decision Trees. Why should we use There are four reasons why should we us e  the random forest algorithm. The one is that it can be used for both  classification and regression  businesses. Overfitting is one critical problem that may make the results worse, but for the Random Forest algorithm, if there are enough trees in the forest, the classifier  won’t overfit  the model. The third reason is the classifier of Random Forest can handle  missing values , and the last advantage is that the Random Forest classifier can be modeled for  categorical values. How does the Random...

How to be a HERO in Machine Learning/Data Science Competitions

At present to master machine learning models one has to participate in the competition which is appearing in various platforms. So how somebody who is new to ml can become a  hero  from  zero . The guideline is in this article. The idea for this is not too hard. Just patience and some hard work are required. I will take an example of a Competition that is just finished within top 10. So the competition generally gives you the problem in which some of the features are hidden because they want you to  explore the data  and come up with the feature that explains the target value. By exploring I mean to say the few things: Look at the data. Get the sense of the data. Find the correlation of all features with a target value. Try new features made up of existing features. Exploration needs some  cleaning of the data  also. Because in general, the host will add the noise into the data so that it becomes a trouble for us to achieve good accuracy. By cleaning I...

NEW TREND OF DATA SCIENCE: REINFORCEMENT LEARNING

Reinforcement Learning (RL) is a machine learning method that empowers a specialist to learn in an intuitive environment by performing trial and error utilizing observations from its very own activities and encounters. In spite of the fact that both direct and reinforcement learning use mapping among input and output, not at all like supervised learning where input gave to the specialist is basically the right set of activities for playing out a task, reinforcement learning utilizes prizes and discipline as signs for positive and negative conduct. When compared with unsupervised learning, reinforcement learning is distinctive as far as objectives are taken into consideration. While the objective in unsupervised learning is to discover synonymities and contrasts between data points, in reinforcement learning the objective is to locate a reasonable activity model that would boost the aggregate total reward of the specialist. Reinforcement learning will be a huge thing in Data science in ...