Skip to main content

Overfitting and Underfitting ( BUG IN ML MODELS )

 

Image for post

The worst performance of the ML( Machine Learning ) Models is due to the Overfitting and Underfitting largely. As in the past, we discovered that the generalization is the idea that every model should do but the overfitting and underfitting will go along with them so we generally have to more aware that the model should not do the overfitting and underfitting. A vital factor in determining the objective function from the training data is how well the model generalizes to new data. Generalization is important because the data we receive is only a sample, it is incomplete and noisy.

Overfitting:-

Image for post

Overfitting is the bug in the ML which means that the model trains itself very much on the training data set.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively influences the appearance of the model on brand-new data. This actually gives us the idea that the noise in the training data set is picked up and learned as ideas by the model. The dilemma is that these ideas do not apply to the test data set and negatively impact the model’s ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. There are many models that can face this problem for e.g. Random forest.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be approached by clipping a tree after it has learned in order to remove some of the detail it has picked up. This problem can be seen many times in the decision trees.

Underfitting in Machine Learning

Image for post

Overfitting is the bug in the ML which means that the model can’t generalize to the new data and can not fit the training data set also.

An underfit machine learning model is not the best model and will be clear as it will have poor performance on the training data and will also perform worst on the test data set.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The solution is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting. So my only suggestion for you is to try some new model and it will help many times.

As we saw the defination of both underfitting and overfitting and we are sure that the underfit is not that big threat it is solvable by just trying the new model but the overfit is major bug in the machine learning models so to resolve this bug there are some few tips from my side to yours.

Overfitting — Solution for this bug

Image for post

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation of the data set that we actually want the model to perform well.

There are two powerful procedures that one can use when evaluating machine learning algorithms to smash overfitting:

  1. Resampling technique.
  2. Validation dataset.

The common hot resampling technique is k-fold cross-validation. It allows us to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

I hope this will help you in understanding the biggest bug in the machine learning model and also help you in to smash it. #bug_smash

Comments

Popular posts from this blog

Random Forest and how it works

  Random Forest Random Forest is a Machine Learning Algorithm based on Decision Trees. Random forest works on the ensemble method which is very common these days. The ensemble method means that to make a decision collectively based on the decision trees. Actually, we make a prediction, not simply based on One Decision Tree, but by an unanimous Prediction, made by ‘ K’  Decision Trees. Why should we use There are four reasons why should we us e  the random forest algorithm. The one is that it can be used for both  classification and regression  businesses. Overfitting is one critical problem that may make the results worse, but for the Random Forest algorithm, if there are enough trees in the forest, the classifier  won’t overfit  the model. The third reason is the classifier of Random Forest can handle  missing values , and the last advantage is that the Random Forest classifier can be modeled for  categorical values. How does the Random...

DBSCAN Clustering Algorithm-with maths

  DBSCAN is a short-form of   D ensity- B ased   S patial   C lustering of   A pplications with   N oise. It is an unsupervised algorithm that will take the set of points and make them into some sets which have the same properties. It is based on the density-based clustering and it will mark the outliers also which do not lie in any of the cluster or set. There are some terms that we need to know before we proceed further for algorithm: Density Reachability A point “p” is said to be   density reachable from a point “q” if point “p” is within ε distance from point “q” and “q” has a sufficient number of points in its neighbors which are within distance ε. Density Connectivity A point “p” and “q” are said to be density connected if there exists a point “r” which has a sufficient number of points in its neighbors and both the points “p” and “q” is within the ε distance. This is a chaining process. So, if “q” is neighbor of “r”, “r” is neighbor of “s”, “s” ...

Neural Network theory and implementation for Regression

Introduction and background In this article, we are going to build the regression model from neural networks for predicting the price of a house based on the features. Here is the implementation and the theory behind it. The neural network is basically if you see is derived from the logistic regression, as we know that in the logistic regression: Formulae for Logistic Regression:  y = ax+b so for every node in each layer, we will apply it and after this output is from the activation function which will have the input from logistic regression and the output is output from the activation function. So now  w e will implement the neural  network  with 5 hidden layers. Implementation 1. Import the libraries which we will going to use 2. Import the dataset and check the types of the columns 3. Now build your training and test set from the dataset. 4. Now we have our data we will now make the model and I will describe to you how it will predict the price. Here we are making...