How To Think About Data

Kovid Rathee
Jun 05, 2020




The real difference between a data engineer and a data scientist — how they think

About a decade ago, when the data science jobs started going mainstream, there was a flood of opportunities in the tech world. However, most companies didn’t understand what to make of it. At one of my earlier stints, I used to hear phrases repeatedly,
 we’re doing big data and we’re doing data science. Because it was advertised that data scientists get big paychecks, data analysts, database administrators, data engineers — all of the wanted to be data scientists; without an understanding of what it requires to be one.

This is not the age of specialization. One needs to be a generalist who specialises in something. Just like life. One can be a neurosurgeon and still drive a car. It’s not odd to find a data engineer and a data scientist both in the same person, but it’s highly unlikely to see it in practice because it’s too broad an area of responsibility. Similarly, it’s highly unlikely to find a neurosurgeon at night who drives an Uber during the day.

"Specialization is for insects" —
Robert A. Heinlein

Being a data engineer and a data scientist, both in one, also comes with a challenge of diving into the vast ocean of knowledge in both these fields related to data. A data engineer should be able to do basic data sciency stuff and a data scientist should be able to do basic data engineering. The same can be said about other fields of software. As in, the data engineer should be able to do basic frontend work and so on.

Having said that, it’s not so much that the skill is the distinguisher between all these fields, rather it is the thought process.

"It doesn’t matter so much what you think, but how you think it"— Christopher Hitchens

Plumbers Or Not


One of my managers used to make an interesting analogy of data engineering with plumbing. Data engineers move data from one place to another. Just like a cooking gas or drinking water need a pipeline to move from the plant to your house, the data needs a pipeline to move from one system to another. At the risk of sounding rude and engineer-splaining, I don’t want to carry forward with this analogy but it is rather true if you think about it.

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity — Dave Bianco

Data engineers are plumbers. But they are also more than that. In addition to making sure that data is transported from one place to another, data engineers make sure that the quality of data is good for use.

They also gauge how the data is going to be used and based on that they make decisions on how to store it, how best to retrieve it, to process it and so on. Some examples are choosing between traditional relational databases, data warehouses and NoSQL data stores or choosing between columnar and row-oriented data stores, choosing task schedulers, choosing data processing infrastructure.

"
While a data engineer might be a plumber, a data scientist is the one who accesses the water through the plumbed pipes and makes lemonade".

Read
Robert Chang’s three piece introduction to data engineering.


Probabilistic vs. Deterministic Thinking


Let’s come to the main point of difference between a data engineer and a data scientist. Obviously, the job titles are different, the KRAs are different but they can surely overlap. The main quality that distinguishes these two creatures is how they think.

"A data engineer thinks in terms of movement, strictness, predictability, cleanliness and resilience — of the data and, of the systems carrying the data".

There’s a striking difference between how these two approach handling data — movement of data, for example, should have the quality of being deterministic. If some data is supposed to arrive from one location to another, it should. If a transformation was to be applied to a dataset for cleaning or modification, it should happen. Data engineering, in that sense, should be predictable, dependable, resilient — Deterministic.

"A data scientist thinks in terms of deriving value, process improvement, decision making, cost and forecasting".

A data scientist doesn’t care about the movement of data from one place to another — at least, not as the main part of the work. A data scientist answers questions using data, recognise patterns (hidden or obvious), make predictions, help make decisions, help understand things even a human looking at the same data can’t. A data scientist works with all that. Hence, their work becomes — Probabilistic.


Afterword


There’s going to be more and more overlap between the work of these two domains of work in the future. Data engineers and software developers will automate a lot of repetitive work by data scientists. Data scientists will make sure that they can work independently of a data engineer by upskilling themselves. A future data scientist or a data engineer will wear both these hats and have a very good understanding of both the domains — and probably even more. As the Robert A. Heinlein quote goes — Specialization is for insects.

“How To Think About Data”
– Kovid Rathee twitter social icon Tweet


Share this article:

0 Comments

Post a comment
Log In to Comment
divider graphic

Related Stories

12

Pandas Essentials For Data Science

Photo by Maarten van den Heuvel on UnsplashPython is a popular language in data science, and of course, the most popular language for machine learn...

Mahbubul Alam
By Mahbubul Alam
29

Predicting survivors of Titanic

F.G.O. Stuart (1843–1923) / Public domainRMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic ...

Dorian Lazar
By Dorian Lazar
17

All About Missing Data Handling

Missing data is an everyday problem that a data professional need to deal with. Though there are many articles, blogs, videos already available, I ...

Baijayanta Roy
By Baijayanta Roy
arrow-up icon