About a decade ago, when the data science jobs started going mainstream, there was a flood of opportunities in the tech world. However, most companies didn’t understand what to make of it. At one of my earlier stints, I used to hear phrases repeatedly,
This is not the age of specialization. One needs to be a generalist who specialises in something. Just like life. One can be a neurosurgeon and still drive a car. It’s not odd to find a data engineer and a data scientist both in the same person, but it’s highly unlikely to see it in practice because it’s too broad an area of responsibility. Similarly, it’s highly unlikely to find a neurosurgeon at night who drives an Uber during the day.
"Specialization is for insects" — Robert A. Heinlein
Being a data engineer and a data scientist, both in one, also comes with a challenge of diving into the vast ocean of knowledge in both these fields related to data. A data engineer should be able to do basic data sciency stuff and a data scientist should be able to do basic data engineering. The same can be said about other fields of software. As in, the data engineer should be able to do basic frontend work and so on.
Having said that, it’s not so much that the skill is the distinguisher between all these fields, rather it is the thought process.
"It doesn’t matter so much what you think, but how you think it"— Christopher Hitchens
Plumbers Or Not
One of my managers used to make an interesting analogy of data engineering with plumbing. Data engineers move data from one place to another. Just like a cooking gas or drinking water need a pipeline to move from the plant to your house, the data needs a pipeline to move from one system to another. At the risk of sounding rude and engineer-splaining, I don’t want to carry forward with this analogy but it is rather true if you think about it.
Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity — Dave Bianco
Data engineers are plumbers. But they are also more than that. In addition to making sure that data is transported from one place to another, data engineers make sure that the quality of data is good for use.
They also gauge how the data is going to be used and based on that they make decisions on how to store it, how best to retrieve it, to process it and so on. Some examples are choosing between traditional relational databases, data warehouses and NoSQL data stores or choosing between columnar and row-oriented data stores, choosing task schedulers, choosing data processing infrastructure.
"While a data engineer might be a plumber, a data scientist is the one who accesses the water through the plumbed pipes and makes lemonade".
Read Robert Chang’s three piece introduction to data engineering.
Probabilistic vs. Deterministic Thinking
Let’s come to the main point of difference between a data engineer and a data scientist. Obviously, the job titles are different, the KRAs are different but they can surely overlap. The main quality that distinguishes these two creatures is how they think.
"A data engineer thinks in terms of movement, strictness, predictability, cleanliness and resilience — of the data and, of the systems carrying the data".
There’s a striking difference between how these two approach handling data — movement of data, for example, should have the quality of being deterministic. If some data is supposed to arrive from one location to another, it should. If a transformation was to be applied to a dataset for cleaning or modification, it should happen. Data engineering, in that sense, should be predictable, dependable, resilient — Deterministic.
"A data scientist thinks in terms of deriving value, process improvement, decision making, cost and forecasting".
A data scientist doesn’t care about the movement of data from one place to another — at least, not as the main part of the work. A data scientist answers questions using data, recognise patterns (hidden or obvious), make predictions, help make decisions, help understand things even a human looking at the same data can’t. A data scientist works with all that. Hence, their work becomes — Probabilistic.
There’s going to be more and more overlap between the work of these two domains of work in the future. Data engineers and software developers will automate a lot of repetitive work by data scientists. Data scientists will make sure that they can work independently of a data engineer by upskilling themselves. A future data scientist or a data engineer will wear both these hats and have a very good understanding of both the domains — and probably even more. As the Robert A. Heinlein quote goes — Specialization is for insects.