How To Think About Data

Kovid Rathee
Jun 05, 2020

How To Think About Data

Jun 05, 2020 4 minutes read

The real difference between a data engineer and a data scientist — how they think

About a decade ago, when the data science jobs started going mainstream, there was a flood of opportunities in the tech world. However, most companies didn’t understand what to make of it. At one of my earlier stints, I used to hear phrases repeatedly,
 we’re doing big data and we’re doing data science. Because it was advertised that data scientists get big paychecks, data analysts, database administrators, data engineers — all of the wanted to be data scientists; without an understanding of what it requires to be one.

This is not the age of specialization. One needs to be a generalist who specialises in something. Just like life. One can be a neurosurgeon and still drive a car. It’s not odd to find a data engineer and a data scientist both in the same person, but it’s highly unlikely to see it in practice because it’s too broad an area of responsibility. Similarly, it’s highly unlikely to find a neurosurgeon at night who drives an Uber during the day.

"Specialization is for insects" —
Robert A. Heinlein

Being a data engineer and a data scientist, both in one, also comes with a challenge of diving into the vast ocean of knowledge in both these fields related to data. A data engineer should be able to do basic data sciency stuff and a data scientist should be able to do basic data engineering. The same can be said about other fields of software. As in, the data engineer should be able to do basic frontend work and so on.

Having said that, it’s not so much that the skill is the distinguisher between all these fields, rather it is the thought process.

"It doesn’t matter so much what you think, but how you think it"— Christopher Hitchens

Plumbers Or Not


One of my managers used to make an interesting analogy of data engineering with plumbing. Data engineers move data from one place to another. Just like a cooking gas or drinking water need a pipeline to move from the plant to your house, the data needs a pipeline to move from one system to another. At the risk of sounding rude and engineer-splaining, I don’t want to carry forward with this analogy but it is rather true if you think about it.

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity — Dave Bianco

Data engineers are plumbers. But they are also more than that. In addition to making sure that data is transported from one place to another, data engineers make sure that the quality of data is good for use.

They also gauge how the data is going to be used and based on that they make decisions on how to store it, how best to retrieve it, to process it and so on. Some examples are choosing between traditional relational databases, data warehouses and NoSQL data stores or choosing between columnar and row-oriented data stores, choosing task schedulers, choosing data processing infrastructure.

"
While a data engineer might be a plumber, a data scientist is the one who accesses the water through the plumbed pipes and makes lemonade".

Read
Robert Chang’s three piece introduction to data engineering.


Probabilistic vs. Deterministic Thinking


Let’s come to the main point of difference between a data engineer and a data scientist. Obviously, the job titles are different, the KRAs are different but they can surely overlap. The main quality that distinguishes these two creatures is how they think.

"A data engineer thinks in terms of movement, strictness, predictability, cleanliness and resilience — of the data and, of the systems carrying the data".

There’s a striking difference between how these two approach handling data — movement of data, for example, should have the quality of being deterministic. If some data is supposed to arrive from one location to another, it should. If a transformation was to be applied to a dataset for cleaning or modification, it should happen. Data engineering, in that sense, should be predictable, dependable, resilient — Deterministic.

"A data scientist thinks in terms of deriving value, process improvement, decision making, cost and forecasting".

A data scientist doesn’t care about the movement of data from one place to another — at least, not as the main part of the work. A data scientist answers questions using data, recognise patterns (hidden or obvious), make predictions, help make decisions, help understand things even a human looking at the same data can’t. A data scientist works with all that. Hence, their work becomes — Probabilistic.


Afterword


There’s going to be more and more overlap between the work of these two domains of work in the future. Data engineers and software developers will automate a lot of repetitive work by data scientists. Data scientists will make sure that they can work independently of a data engineer by upskilling themselves. A future data scientist or a data engineer will wear both these hats and have a very good understanding of both the domains — and probably even more. As the Robert A. Heinlein quote goes — Specialization is for insects.
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!