Data Scientists Without Data Engineering Skills Will Face the Harsh Truth

Soner Yıldırım
Jul 05, 2021

Contents Outline

Data Scientists Without Data Engineering Skills Will Face the Harsh Truth

Jul 05, 2021 4 minutes read

OPINION.

You have probably read an article about the difference between a data scientist and a data engineer. I always thought the distinction was clear. Data engineers make the data ready for use and then data scientists work on that data.

However, my opinion on this distinction has changed dramatically after I started working as a data scientist.


Photo by Ben White on Unsplash

You have probably read an article about the difference between a data scientist and a data engineer. I always thought the distinction was clear. Data engineers make the data ready for use and then data scientists work on that data.

Everything in data science starts with data. Your machine learning model is just as good as the data fed into it. Garbage in, garbage out! A data scientist cannot do some magic to create a valuable product without proper data.

The proper data is not always readily available for data scientists. In most cases, it will the responsibility of the data scientist to convert the raw data to a proper format.

Unless you work for a big tech company that has separate teams of data engineers and data scientists, you should possess the ability and skills to handle some data engineering tasks. These tasks cover a broad range of operations and I will elaborate on this in the remaining part of the article.

What is the difference anyway?

I would like to state my opinion on the relationship between the job of a data engineer and a data scientist.

A data engineer is a data engineer. A data scientist should be both a data scientist and a data engineer.

It may seem like an arguable statement. However, I would like to emphasize that my opinion was different before I started working as a data scientist. I used to think of data engineers and data scientists as separate entities.

In the remaining part of the article, I will try to explain what I mean by a data scientist should be both a data scientist and a data engineer.

For instance, data engineers do a set of operations known as ETL (extract, transform, load). It covers the procedures for collecting data from one or more sources, apply some transformations, and then load into a different source.

I would definitely not be surprised if a data scientist is expected to perform ETL operations. Data science is still evolving and most companies do not have clearly separated data engineer and data scientist roles. As a result, a data scientist should be able to perform some data engineering tasks.

If you expect to only work on running machine learning algorithms with ready-to-use data, you will face the harsh truth soon after you start working as a data scientist.

You may have to write some stored procedures in SQL to preprocess the client data. It is also possible that you receive the client data from a few different sources. It will be your job to extract and combine them. Then, you will need to load them into a single source. In order to write efficient stored procedures, you need extensive SQL skills.

The transform part of ETL procedures involves in many data cleaning and manipulation steps. SQL may not be the best choice if you work with large-scale data. Distributed computing is a better alternative in such cases. Therefore, a data scientist should also be familiar with distributed computing.

Your best friend in distributed computing might be Spark. It is an analytics engine used for large-scale data processing. We can distribute both data and computations over clusters to achieve a substantial performance increase.

If you are familiar with Python and SQL, you won’t have hard time getting used to Spark. You can use Spark features with PySpark which is a Python API for Spark.

Read also: A Beginner’s Guide to Apache Spark

When it comes to work with clusters, the optimal environment is the cloud. There are various cloud providers but AWS, Azure, and Google Cloud Platform (GCP) lead the way.

Although the PySpark code is the same for all cloud providers, how you setup the environment and create clusters change between them. They allow for creating clusters using both scripts or the user interface.

Distributed computing over clusters is a whole different world. It is nothing like doing analysis in your computer. It has very different dynamics. Evaluating cluster performance and choosing the optimal number of workers for a cluster will be your predominant concerns.

Read also:
* The Full Stack Data Scientist
* Everything a Data Scientist Should Know About Data Management
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!