*A saga of details, important yet usually forgotten.*

If you have done your first MOOC on Machine Learning (aka Andrew Ng on Coursera) and want to explore further opportunities in this field or you are someone who has been observing the buzz of Data Science and ML and want to check out if this field is the one for you, you are at the right place (virtually) to learn about a few ingredients of the recipe which will help you shine bright with ML.

Dividing ML into components, the 3 of them that you should be concerned about right now would be -

- Data, because, well it’s “data” science,
- The Math that goes behind the model,
- Programming.

In this article, the maximum amount of focus will be put on data, so we cover the other two first.

**The Math that goes behind the model**

Knowing how an algorithm works is not necessary until the first time you run model.fit() and model.predict(), but to leverage the actual power of an algorithm, you need to understand how it works. Logistic Regression is a type of generalized linear model, so a linear regression model which outputs continuous data can easily be modified to give categories as the result. Accuracy measures or parameter values, everything is a mathematical function in some form or the other.

Fig. The torch-bearer of model.fit() and model.predict() (Image Source)

So the next question that arises is, “How much Math to know?” The answer to this question somewhat depends on what your goals are. If you want to be a researcher who is going to improve on the present state-of-the-art models, you obviously need to know every bit of Math that goes into your Machine Learning Model, because that is how you are going to improve upon what already exists, by finding new mathematical functions which work upon the data to produce better results. If you just want good enough predictions on your model that suits your objective, knowing just enough Maths to sail through the course of the project is good enough.

**Programming**

The programming of an ML model starts after you have figured out the problem you wish to solve and it’s type (classification, regression, clustering, ranking, etc.), done the necessary preprocessing and decided which algorithm to start with. It is important to get a good starting place — read about any related previous work and code it; once you get a baseline accuracy, try different algorithms or build upon the same. This helps in understanding what’s going on and what you need to do next to achieve your purpose.

Document as much as possible because there’s a high chance that you may not remember the logic behind your workflow 6 months from now. And following good coding practices always help, be it the coder or the ones who read the code.

**Data**

This is the part that heavily decides the success of an ML model, so spending the maximum time here makes the most sense.

A prerequisite of making useful models is the knowledge of the terminology used in Machine Learning.

A

*label*is the thing we’re predicting — the y variable in simple linear regression.

A

*feature*is an input variable — the x variable in simple linear regression.

An

*example*is a particular instance of data, x.

This always helps in better reports and documentation of the model, and also in error corrections. You can check out the Machine Learning Glossaryby Google.

Now that you are good to go, the first thing that comes in the workflow of an ML project is the problem statement. Focus on problems that would be difficult to solve with traditional programming. It is a great practice to think if the problem you are trying to solve actually requires Machine Learning. If you are starting out with a Kaggle competition, you have the objective already stated. But if you want to work on a real world dataset, make the most basic plot at first, try finding out patterns and relationships between variables and based on those inferences, frame a problem statement for yourself. This is how statistical projects are mostly dealt with.

Sometimes one can get a client requirement to use a specific algorithm for a problem. Say, the data at hand is a time series of daily mean temperature of a city over a certain period. The problem statement here is to forecast or predict the mean temperature for 6 days using supervised learning.

Sample Data

Here, all that is given is a series of values which need to be somehow converted into features and labels. Temperature of a certain day is not entirely different from that of the previous few days. If the temperature today is 20 °C, it can’t be 10 °C or 15 °C tomorrow, it has to lie somewhere close to 20 °C, unless under extraordinary conditions. This idea can be used to do our job.

Using a sliding window method, the previous 2 temperatures can be used as features for the current temperature (label). The no. of features considered can also be increased, and any supervised algorithm can be applied to this problem.

Defining a problem statement is basically figuring out what you want to do, and how do you want to do it. This article provides more clarity about the same.

Now that the problem which needs to be solved is clear, the 1st step towards the solution is Exploratory Data Analysis or EDA. In crude terms, EDA is getting an idea of the dataset, looking for discrepancies, missing values, finding trends and patterns, if any; a general understanding so that you can answer any descriptive question about the dataset without having to look it up every single time.

Daniel Bourke’s article provides a great EDA checklist.

In any ML project, visualization is the key. Given that 2 variables x and y have the following properties, how many vastly different datasets can you think of???

Behold, the Anscombe’s quartet !!!

Although the 4 datasets share the same descriptive measures, they are entirely different from each other. The given regression line with the accuracy of 67% won’t be able to give good predictions for the 2nd or the 4th dataset.

And in case you feel that EDA helps only with tabular datasets, you can read the observations of Andrej Karpathy on manually classifying the CIFAR10 dataset.

Another aspect of EDA that needs a mention is the importance of the datatypes of the features present in the dataset. If you scrape any website, the data that you collect from the necessary tags is in string format at times, which might go unnoticed and then lead to problems while training your model. Again, not every algorithm can work on every data type. It is, hence, a good practice to have a look at the data types and choose the required ones suitable for the ML model appropriately.

Sometimes the dataset isn’t large enough and hence, may not be suitable for some ML algorithms. It’s not always necessary to use one specific algorithm unless explicitly specified; go with the one that suits your purpose. Using a decision tree for a simple classification task may provide better accuracy than a 5 layer Neural Network, and in lesser time.

One major component in an ML workflow is the splitting of the dataset into training, testing and/or validation datasets. If the examples are not randomized or are sequential, the division needs to be carefully done, else bias will creep into the model. You also need to ensure class balance in categorisation tasks while splitting.

In 2015, this case came to light, when the object recognition feature of Google Photos app tagged two black people as gorillas. This is a real world scenario where a product was a failure because it was trained on a dataset which had quite high an imbalance for some classes. As a developer, one never wants to be in such a position and hence, a little bit of extra care always helps.

A concept that has recently been getting much attention by researchers and data scientists in building better models is that of feature engineering.

“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.”

-Jason Brownlee, inDiscover Feature Engineering, How to Engineer Features and How to Get Good at It

In 2014, for a competition by DrivenData, participants needed to build a machine learning algorithm that could automate the process of attaching labels to different purchase items, to have an idea about how schools were spending money to tailor the strategy recommendations to improve outcomes for students, teachers, and administrators.

If you think about an algorithm that has the potential of winning a competition and being the best, which ones come to your mind? The winning model that took it all was a logistic regression, which involved a lot of carefully created features. This proves the power of feature engineering, and is a reason why you should check it out. You can learn more about this challenge here.

Now that you have a problem statement and a clean, engineered dataset, you are good to go with your first ML model. Something that can never be stated enough times is that this field allows a lot of area for trying out new and different ideas. With the generation of data in unbelievable quantities, there is immense scope and opportunities to build new and useful applications, or improve upon the existing. Remember that in data science, there is no correct way of solving a problem, whatever works for your purpose and gives you the desired results, helps you to grow and learn in an honest manner is the right way.

P. S. A great resource for learning more on this topic can be found here.

“Things To Know Before You Make Your 1st Machine Learning Model”– Ankita Prakash Tweet