How To Think About Data

Kovid Rathee
Jun 05, 2020

The real difference between a data engineer and a data scientist — how they think

About a decade ago, when the data science jobs started going mainstream, there was a flood of opportunities in the tech world. However, most companies didn’t understand what to make of it. At one of my earlier stints, I used to hear phrases repeatedly,
 we’re doing big data and we’re doing data science. Because it was advertised that data scientists get big paychecks, data analysts, database administrators, data engineers — all of the wanted to be data scientists; without an understanding of what it requires to be one.

This is not the age of specialization. One needs to be a generalist who specialises in something. Just like life. One can be a neurosurgeon and still drive a car. It’s not odd to find a data engineer and a data scientist both in the same person, but it’s highly unlikely to see it in practice because it’s too broad an area of responsibility. Similarly, it’s highly unlikely to find a neurosurgeon at night who drives an Uber during the day.

"Specialization is for insects" —
Robert A. Heinlein

Being a data engineer and a data scientist, both in one, also comes with a challenge of diving into the vast ocean of knowledge in both these fields related to data. A data engineer should be able to do basic data sciency stuff and a data scientist should be able to do basic data engineering. The same can be said about other fields of software. As in, the data engineer should be able to do basic frontend work and so on.

Having said that, it’s not so much that the skill is the distinguisher between all these fields, rather it is the thought process.

"It doesn’t matter so much what you think, but how you think it"— Christopher Hitchens

Plumbers Or Not

One of my managers used to make an interesting analogy of data engineering with plumbing. Data engineers move data from one place to another. Just like a cooking gas or drinking water need a pipeline to move from the plant to your house, the data needs a pipeline to move from one system to another. At the risk of sounding rude and engineer-splaining, I don’t want to carry forward with this analogy but it is rather true if you think about it.

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity — Dave Bianco

Data engineers are plumbers. But they are also more than that. In addition to making sure that data is transported from one place to another, data engineers make sure that the quality of data is good for use.

They also gauge how the data is going to be used and based on that they make decisions on how to store it, how best to retrieve it, to process it and so on. Some examples are choosing between traditional relational databases, data warehouses and NoSQL data stores or choosing between columnar and row-oriented data stores, choosing task schedulers, choosing data processing infrastructure.

While a data engineer might be a plumber, a data scientist is the one who accesses the water through the plumbed pipes and makes lemonade".

Robert Chang’s three piece introduction to data engineering.

Probabilistic vs. Deterministic Thinking

Let’s come to the main point of difference between a data engineer and a data scientist. Obviously, the job titles are different, the KRAs are different but they can surely overlap. The main quality that distinguishes these two creatures is how they think.

"A data engineer thinks in terms of movement, strictness, predictability, cleanliness and resilience — of the data and, of the systems carrying the data".

There’s a striking difference between how these two approach handling data — movement of data, for example, should have the quality of being deterministic. If some data is supposed to arrive from one location to another, it should. If a transformation was to be applied to a dataset for cleaning or modification, it should happen. Data engineering, in that sense, should be predictable, dependable, resilient — Deterministic.

"A data scientist thinks in terms of deriving value, process improvement, decision making, cost and forecasting".

A data scientist doesn’t care about the movement of data from one place to another — at least, not as the main part of the work. A data scientist answers questions using data, recognise patterns (hidden or obvious), make predictions, help make decisions, help understand things even a human looking at the same data can’t. A data scientist works with all that. Hence, their work becomes — Probabilistic.


There’s going to be more and more overlap between the work of these two domains of work in the future. Data engineers and software developers will automate a lot of repetitive work by data scientists. Data scientists will make sure that they can work independently of a data engineer by upskilling themselves. A future data scientist or a data engineer will wear both these hats and have a very good understanding of both the domains — and probably even more. As the Robert A. Heinlein quote goes — Specialization is for insects.

“How To Think About Data”
– Kovid Rathee twitter social icon Tweet

Share this article:


Post a comment
Log In to Comment

Related Stories

Nov 25, 2021

5 Tips To Ace Your Job Interview For A Data Scientist Opening

5 Tips To Ace Your Job Interview For A Data Scientist Opening.PNG 795.94 KBImage SourceAspiring data scientists have a bright future ahead of them....

Daniel Morales
By Daniel Morales
Nov 12, 2021

When to Avoid Deep Learning

IntroductionThis article is intended for data scientists who may consider using deep learning algorithms, and want to know more about the cons of i...

Matt Przybyla
By Matt Przybyla
Oct 16, 2021

6 Advanced Statistical Concepts in Data Science

The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Be...

Nagesh Singh Chauhan
By Nagesh Singh Chauhan

Join our private community in Slack

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!

We'll send you an invitational link to your email immediatly.
arrow-up icon