Skip to main content

How to be a HERO in Machine Learning/Data Science Competitions

At present to master machine learning models one has to participate in the competition which is appearing in various platforms. So how somebody who is new to ml can become a hero from zero. The guideline is in this article.

Image for post

The idea for this is not too hard. Just patience and some hard work are required. I will take an example of a Competition that is just finished within top 10. So the competition generally gives you the problem in which some of the features are hidden because they want you to explore the data and come up with the feature that explains the target value. By exploring I mean to say the few things:

  1. Look at the data. Get the sense of the data.
  2. Find the correlation of all features with a target value.
  3. Try new features made up of existing features.

Exploration needs some cleaning of the data also. Because in general, the host will add the noise into the data so that it becomes a trouble for us to achieve good accuracy. By cleaning I mean

  1. Trading with NaN values.
  2. Find and remove outliers from training.
Image for post

One of the most essential steps is to select the features for training. To select the features one should know the correlation and how to generate the new features which are extremely impactful. These features can be mean of some features, result by adding some features, etc. There are many ways people can look at this.

Then the last part but not the least is to select the model. By selecting the model not means just take some model and train it. But the most important part is to train the model with some good value of its parameters. Because I saw that a little tweak in parameters will let you achieve high accuracy.

For beginners:
1- Start learning the concepts of math and statistics.
2- Learn programming tools for data science either R or Python.
3- Learn Machine learning algorithms (part 1)

And here are the resources for the previous list:
Descriptive Statistics from Udacity: https://www.udacity.com/course/intro-to-descriptive-statistics--ud827
A good statistics book (Optional): http://onlinestatbook.com/2/index.html
Introduction to Probabilities course from EDx: https://www.edx.org/course/introduction-probability-science-mitx-6-041x-2
Introduction to Probability book: https://www.stat.berkeley.edu/~aldous/134/grinstead.pdf
Intro to Inferential Statistics course from Udacity: https://www.udacity.com/course/intro-to-inferential-statistics--ud201

For professionals:
1- Learn the concepts of Deep learning and build at least one project.
2- Learn Data Visualization

Recommended Resources

Here is a glimpse of the result that I achieved after following these steps.

Image for post

I hope these procedures will help you. Let me get a clap if this helps you in some way.

 

Comments

Popular posts from this blog

Random Forest and how it works

  Random Forest Random Forest is a Machine Learning Algorithm based on Decision Trees. Random forest works on the ensemble method which is very common these days. The ensemble method means that to make a decision collectively based on the decision trees. Actually, we make a prediction, not simply based on One Decision Tree, but by an unanimous Prediction, made by ‘ K’  Decision Trees. Why should we use There are four reasons why should we us e  the random forest algorithm. The one is that it can be used for both  classification and regression  businesses. Overfitting is one critical problem that may make the results worse, but for the Random Forest algorithm, if there are enough trees in the forest, the classifier  won’t overfit  the model. The third reason is the classifier of Random Forest can handle  missing values , and the last advantage is that the Random Forest classifier can be modeled for  categorical values. How does the Random...

DBSCAN Clustering Algorithm-with maths

  DBSCAN is a short-form of   D ensity- B ased   S patial   C lustering of   A pplications with   N oise. It is an unsupervised algorithm that will take the set of points and make them into some sets which have the same properties. It is based on the density-based clustering and it will mark the outliers also which do not lie in any of the cluster or set. There are some terms that we need to know before we proceed further for algorithm: Density Reachability A point “p” is said to be   density reachable from a point “q” if point “p” is within ε distance from point “q” and “q” has a sufficient number of points in its neighbors which are within distance ε. Density Connectivity A point “p” and “q” are said to be density connected if there exists a point “r” which has a sufficient number of points in its neighbors and both the points “p” and “q” is within the ε distance. This is a chaining process. So, if “q” is neighbor of “r”, “r” is neighbor of “s”, “s” ...

Neural Network theory and implementation for Regression

Introduction and background In this article, we are going to build the regression model from neural networks for predicting the price of a house based on the features. Here is the implementation and the theory behind it. The neural network is basically if you see is derived from the logistic regression, as we know that in the logistic regression: Formulae for Logistic Regression:  y = ax+b so for every node in each layer, we will apply it and after this output is from the activation function which will have the input from logistic regression and the output is output from the activation function. So now  w e will implement the neural  network  with 5 hidden layers. Implementation 1. Import the libraries which we will going to use 2. Import the dataset and check the types of the columns 3. Now build your training and test set from the dataset. 4. Now we have our data we will now make the model and I will describe to you how it will predict the price. Here we are making...