How to Start a Machine Learning Project

Contents Outline

How to Start a Machine Learning Project

Jul 17, 2020 7 minutes read

In a previous article, we discussed a lot about what machine learning is, how machine learning works, and examples of the implementation of machine learning. If you are new to machine learning you might ask, how do you start a machine learning project?

This may be experienced by many people, including me. I read a lot of books but was still confused about where to start. I found that the best way to learn to start a machine learning project is by designing and completing small projects. Because to be big, we have to start small. But the process can be different for each person.

In this article, I will share several stages for working on a machine learning project based on the experience that I have.

Understanding the problem


Photo by You X Ventures on Unsplash

The initial stage in starting a project is knowing what problems we will solve. This applies to any project, including the machine learning project. Of course, everything starts with a problem, because if there are no problems, nothing needs to be solved.

We must determine the problem that we will solve. For example, we want to know the sentiments of user opinion about a product on social media in real-time. But because of the many opinions that exist, it is impossible to do it entirely by humans.

Then we also determine the goal. From that, we can determine that the goal is to create a machine learning system to classify user opinions (sentiment analysis) in real-time also predict the sentiment of future opinions.

“If you can’t solve a problem, then there is an easier problem you can solve: find it.” George Pólya

I suggest you learn about decomposition in computational thinking, which is the process of breaking down complex problems into smaller parts. With decomposition, problems that seem overwhelming at first are easier to manage.

Data acquisition


If you have found a problem and the goal that you want to solve, then the next step is to get the required data. There are many ways to do data collection. I will explain some of them.
  • The first way is to download open-sourced data on the internet such as Kaggle, Google dataset, UCI machine learning, etc. But you also have to pay attention to the limitations of using these datasets because some datasets can only be used for personal needs or for research purposes only.
  • The next way is to do crawling/scraping on the website. For example, if you want to retrieve data about hate speech comments on social media, you can crawl on social media Twitter, Instagram, etc. or if you need data from a news site, you can do web scraping on the news site.

In some cases, you might need to label data, especially if you get a dataset from doing web crawling/scraping and the machine learning method that you will use is supervised learning. What must be considered is the possibility of bias when labeling data. This can affect the performance of the model.

If you want to validate the data that you have labeled, you can ask the help of people who are experts in their fields. For example, if you make a dataset in the health field, then you can validate the data that you made to the doctor.

Data preparation


Photo by Luke Chesser on Unsplash

After we get the data we need, the next step is to prepare the data before entering the training phase. Why is that?

Let’s compare the data as material for cooking. Before we cook, we definitely process the raw ingredients first, such as washing, cleaning, removing unnecessary ingredients, cut into pieces, etc., it is not possible to put raw materials into a frying pan.

So it is with data. We need to prepare the data so that when entering the training phase, the data does not contain noise which will affect the performance of the model created.

Before entering further, please note that there are 2 types of data, namely structured data, and unstructured data. Structured data is a data model that organizes elements of data and standardizes how they relate to one another. Usually, this type of data is in the form of tables, JSON, etc. Whereas the unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. This type of data is usually in the form of free text, images, signals, etc.


Structured data vs Unstructured data

The following are some of the processes commonly used in preparing data.
  • Data cleaning is used to eliminate data or features that are not needed. For each data type, the treatment is different. In structured data, it is used to clean inconsistent data, missing values, or duplicate values. Whereas in unstructured data, for example in the case of text, it is used to clean symbols, numbers, punctuation marks, or words that are actually less necessary in the process of forming the model.
  • Data transformation is used to change data structures. This is especially necessary for the case of unstructured data (e.g. text) because basically the classifier model cannot accept text input, so it must be transformed first into another form. Some commonly used methods are PCA, LDA, TF-IDF, etc.

Another thing that is also important to do is exploratory data analysis (EDA). EDA functions as initial investigations of data to find patterns, spot anomalies, hypotheses tests, and check assumptions with the help of statistical summaries and graphical representations.

Modelling


Photo by Roman Kraft on Unsplash

This is the part that you may have been waiting for the most. At this stage, we will make a machine learning model. As we discussed in the previous article, there are several approaches in machine learning, namely supervised learning, unsupervised learning, and reinforcement learning. We can determine the approach we take based on the data/problem we observed before.

Other tips from me, find out first the weaknesses and strengths of each machine learning method before doing modeling because that will save a lot of time. Each method has advantages and disadvantages for certain data characteristics. For example, there are methods that work well if the input is normalized first, or methods if too large data will cause overfit and there are also methods that require very large data.

Do not let you spend time just trying one by one method to produce the best results. Because usually, one machine learning method has many parameters that we can modify. For example, if we want to make a model using the Neural Network we can change the learning rate parameters, the number of hidden layers, the activation function used, etc.

Perhaps many people also suggest using deep learning, because it has always been proven to get good results in several experiments. But before that, think first whether the case really requires deep learning. Don’t use deep learning just to follow the trend. Because the cost required is very large.

If it is possible to use traditional machine learning methods then use that method. However, if the case is very complex and cannot be handled by traditional machine learning methods, then you can use deep learning.

Evaluate


Photo by Kaleidico on Unsplash

The final stage is the evaluation process. You certainly don’t want if the performance of the model you are training doesn’t produce good results. Especially if your prediction model has a lot of mistakes.


One of the fastest ways to determine whether the model that we are making is good or not is to measure its performance. There are several ways to calculate performance including accuracy, f1 measure, silhouette score, etc. One other way to evaluate your model is to validate your model with people who are experts in their fields.
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!