To be a real “full-stack” data scientist, or what many bloggers and employers call a “unicorn” you’ve to master every step of the data science process — all the way from storing your data, to putting your finished product (typically a predictive model) in production.
But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought.
Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don’t realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work.
And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere.
Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of.
As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs.
The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management — from types of databases, where and how data is stored and processed, to the current commercial options — so the aspiring “unicorns” could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.
The Rise of Unstructured Data & Big Data Tools
IBM 305 RAMAC (Source: WikiCommons)
The story of data science is really the story of data storage. In the pre-digital age, data was stored in our heads, on clay tablets, or on paper, which made aggregating and analyzing data extremely time-consuming.
In 1956, IBM introduced the first commercial computer with a magnetic hard drive, 305 RAMAC. The entire unit required 30 ft x 50 ft of physical space, weighed over a ton, and for $3,200 a month, companies could lease the unit to store up to 5 MB of data.
In the 60 years since, prices per gigabyte in DRAM has dropped from a whopping $2.64 billion in 1965 to $4.9 in 2017. Besides being magnitudes cheaper, data storage also became much denser/smaller in size.
A disk platter in the 305 RAMAC stored a hundred bits per square inch, compared to over a trillion bits per square inch in a typical disk platter today.
This combination of dramatically reduced cost and size in data storage is what makes today’s big data analytics possible.
With ultra-low storage cost, building the data science infrastructure to collect and extract insights from huge amount of data became a profitable approach for businesses.
And with the profusion of IoT devices that constantly generate and transmit users’ data, businesses are collecting data on an ever increasing number of activities, creating a massive amount of high-volume, high-velocity, and high-variety information assets (or the “three Vs of big data”).
Most of these activities (e.g. emails, videos, audio, chat messages, social media posts) generate unstructured data, which accounts for almost 80% of total enterprise data today and is growing twice as fast as structured data in the past decade.
125 Exabytes of enterprise data was stored in 2017; 80% was unstructured data. (Source: Credit Suisse)
This massive data growth dramatically transformed the way data is stored and analyzed, as the traditional tools and approaches were not equipped to handle the “three Vs of big data.” New technologies were developed with the ability to handle the ever increasing volume and variety of data, and at a faster speed and lower cost.
These new tools also have profound effects on how data scientists do their job — allowing them to monetize the massive data volume by performing analytics and building new applications that were not possible before. Below are the major big data management innovations that we think every data scientist should know about.
Relational Databases & NoSQL
Relational Database Management Systems (RDBMS) emerged in the 1970’s to store data as tables with rows and columns, using Structured Query Language (SQL) statements to query and maintain the database.
A relational database is basically a collection of tables, each with a schema that rigidly defines the attributes and types of data that they store, as well as keys that identify specific columns or rows to facilitate access.
The RDBMS landscape was once ruled by Oracle and IBM, but today many open source options, like MySQL, SQLite, and PostgreSQL are just as popular.
RDBMS ranked by popularity (Source: DB-Engines)
Relational databases found a home in the business world due to some very appealing properties. Data integrity is absolutely paramount in relational databases.
RDBMS satisfy the requirements of Atomicity, Consistency, Isolation, and Durability (or ACID-compliant) by imposing a number of constraints to ensure that the stored data is reliable and accurate, making them ideal for tracking and storing things like account numbers, orders, and payments. But these constraints come with costly tradeoffs.
Because of the schema and type constraints, RDBMS are terrible at storing unstructured or semi-structured data.
The rigid schema also makes RDBMS more expensive to set up, maintain and grow. Setting up a RDBMS requires users to have specific use cases in advance; any changes to the schema are usually difficult and time-consuming.
In addition, traditional RDBMS were designed to run on a single computer node, which means their speed is significantly slower when processing large volumes of data. Sharding RDBMS in order to scale horizontally while maintaining ACID compliance is also extremely challenging. All these attributes make traditional RDBMS ill-equipped to handle modern big data.
By the mid-2000’s, the existing RDBMS could no longer handle the changing needs and exponential growth of a few very successful online businesses, and many non-relational (or NoSQL) databases were developed as a result (here’s a story on how Facebook dealt with the limitations of MySQL when their data volume started to grow).
Without any known solutions at the time, these online businesses invented new approaches and tools to handle the massive amount of unstructured data they collected: Google created GFS, MapReduce, and BigTable; Amazon created DynamoDB; Yahoo created Hadoop; Facebook created Cassandra and Hive; LinkedIn created Kafka.
Some of these businesses open sourced their work; some published research papers detailing their designs, resulting in a proliferation of databases with the new technologies, and NoSQL databases emerged as a major player in the industry.
An explosion of database options since the 2000’s. Source: Korflatis et. al (2016)
NoSQL databases are schema agnostic and provide the flexibility needed to store and manipulate large volumes of unstructured and semi-structured data.
Users don’t need to know what types of data will be stored during set-up, and the system can accommodate changes in data types and schema.
Designed to distribute data across different nodes, NoSQL databases are generally more horizontally scalable and fault-tolerant.
However, these performance benefits also come with a cost — NoSQL databases are not ACID compliant and data consistency is not guaranteed. They instead provide “eventual consistency”: when old data is getting overwritten, they’d return results that are a little wrong temporarily.
For example, Google’s search engine index can’t overwrite its data while people are simultaneously searching a given term, so it doesn’t give us the most up-to-date results when we search, but it gives us the latest, best answer it can.
While this setup won’t work in situations where data consistency is absolutely necessary (such as financial transactions); it’s just fine for tasks that require speed rather than pin-point accuracy.
There are now several different categories of NoSQL, each serving some specific purposes. Key-Value Stores, such as Redis, DynamoDB, and Cosmos DB, store only key-value pairs and provide basic functionality for retrieving the value associated with a known key.
They work best with a simple database schema and when speed is important. Wide Column Stores, such as Cassandra, Scylla, and HBase, store data in column families or tables, and are built to manage petabytes of data across a massive, distributed system.
Document Stores, such as MongoDB and Couchbase, store data in XML or JSON format, with the document name as key and the contents of the document as value.
The documents can contain many different value types, and can be nested, making them particularly well-suited to manage semi-structured data across distributed systems.
Graph Databases, such as Neo4J and Amazon Neptune, represent data as a network of related nodes or objects in order to facilitate data visualizations and graph analytics.
Graph databases are particularly useful for analyzing the relationships between heterogeneous data points, such as in fraud prevention or Facebook’s friends graph.
MongoDB is currently the most popular NoSQL database, and has delivered substantial values for some businesses that have been struggling to handle their unstructured data with the traditional RDBMS approach.
Here are two industry examples: after MetLife spent years trying to build a centralized customer database on a RDBMS that could handle all its insurance products, someone at an internal hackathon built one with MongoDB within hours, which went to production in 90 days.
YouGov, a market research firm that collects 5 gigabits of data an hour, saved 70 percent of the storage capacity it formerly used by migrating from RDBMS to MongoDB.
Data Warehouse, Data Lake, & Data Swamp
As data sources continue to grow, performing data analytics with multiple databases became inefficient and costly. One solution called Data Warehouse emerged in the 1980’s, which centralizes an enterprise’s data from all of its databases.
Data Warehouse supports the flow of data from operational systems to analytics/decision systems by creating a single repository of data from various sources (both internal and external). In most cases, a Data Warehouse is a relational database that stores processed data that is optimized for gathering business insights.
It collects data with predetermined structure and schema coming from transactional systems and business applications, and the data is typically used for operational reporting and analysis.
But because data that goes into data warehouses needs to be processed before it gets stored — with today’s massive amount of unstructured data, that could take significant time and resources.
In response, businesses started maintaining Data Lakes in the 2010's, which store all of an enterprise’s structured and unstructured data at any scale. Data Lakes store raw data, and could be set up without having to first define the data structure and schema.
Data Lakes allow users to run analytics without having to move the data to a separate analytics system, enabling businesses to gain insights from new sources of data that was not available for analysis before, for instance by building machine learning models using data from log files, click-streams, social media, and IoT devices.
By making all of the enterprise data readily available for analysis, data scientists could answer a new set of business questions, or tackle old questions with new data.
Data Warehouse and Data Lake Comparisons (Source: AWS)
A common challenge with the Data Lake architecture is that without the appropriate data quality and governance framework in place, when terabytes of structured and unstructured data flow into the Data Lakes, it often becomes extremely difficult to sort through their content.
The Data Lakes could turn into Data Swamps as the stored data become too messy to be usable. Many organizations are now calling for more data governance and metadata management practices to prevent Data Swamps from forming.
Distributed & Parallel Processing: Hadoop, Spark, & MPP
While storage and computing needs grew by leaps and bounds in the last several decades, traditional hardware has not advanced enough to keep up.
Enterprise data no longer fits neatly in standard storage, and the computation power required to handle most big data analytics tasks might take weeks, months, or simply not possible to complete on a standard computer.
To overcome this deficiency, many new technologies have evolved to include multiple computers working together, distributing the database to thousands of commodity servers. When a network of computers are connected and work together to accomplish the same task, the computers form a cluster.
A cluster can be thought of as a single computer, but can dramatically improve the performance, availability, and scalability over a single, more powerful machine, and at a lower cost by using commodity hardware.
Apache Hadoop is an example of distributed data infrastructures that leverage clusters to store and process massive amounts of data, and what enables the Data Lake architecture.
Evolution of database technologies (Source: Business Analytic 3.0)
When you think Hadoop, think “distribution.” Hadoop consists of three main components: Hadoop Distributed File System (HDFS), a way to store and keep track of your data across multiple (distributed) physical hard drives; MapReduce, a framework for processing data across distributed processors; and Yet Another Resource Negotiator (YARN), a cluster management framework that orchestrates the distribution of things such as CPU usage, memory, and network bandwidth allocation across distributed computers.
Hadoop’s processing layer is an especially notable innovation: MapReduce is a two step computational approach for processing large (multi-terabyte or greater) data sets distributed across large clusters of commodity hardware in a reliable, fault-tolerant way.
The first step is to distribute your data across multiple computers (Map), with each performing a computation on its slice of the data in parallel.
The next step is to combine those results in a pair-wise manner (Reduce). Google published a paper on MapReduce in 2004, which got picked up by Yahoo programmers who implemented it in the open source Apache environment in 2006, providing every business the capability to store an unprecedented volume of data using commodity hardware.
Even though there are many open source implementations of the idea, the Google brand name MapReduce has stuck around, kind of like Jacuzzi or Kleenex.
Hadoop is built for iterative computations, scanning massive amounts of data in a single operation from disk, distributing the processing across multiple nodes, and storing the results back on disk.
Querying zettabytes of indexed data that would take 4 hours to run in a traditional data warehouse environment could be completed in 10–12 seconds with Hadoop and HBase. Hadoop is typically used to generate complex analytics models or high volume data storage applications such as retrospective and predictive analytics; machine learning and pattern matching; customer segmentation and churn analysis; and active archives.
But MapReduce processes data in batches and is therefore not suitable for processing real-time data. Apache Spark was built in 2012 to fill that gap.
Spark is a parallel data processing tool that is optimized for speed and efficiency by processing data in-memory. It operates under the same MapReduce principle, but runs much faster by completing most of the computation in memory and only writing to disk when memory is full or the computation is complete.
This in-memory computation allows Spark to “run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.” However, when the data set is so large that insufficient RAM becomes an issue (usually hundreds of gigabytes or more), Hadoop MapReduce might outperform Spark.
Spark also has an extensive set of data analytics libraries covering a wide range of functions: Spark SQL for SQL and structured data; MLib for machine learning, Spark Streaming for stream processing, and GraphX for graph analytics.
Since Spark’s focus is on computation, it does not come with its own storage system and instead runs on a variety of storage systems such as Amazon S3, Azure Storage, and Hadoop’s HDFS.
In an MPP system, all the nodes are interconnected and data could be exchanged across the network (Source: IBM)
Hadoop and Spark are not the only technologies that leverage clusters to process large volumes of data.
Another popular computational approach to distributed query processing is called Massively Parallel Processing (MPP).
Similar to MapReduce, MPP distributes data processing across multiple nodes, and the nodes process the data in parallel for faster speed. But unlike Hadoop, MPP is used in RDBMS and utilizes a “share-nothing” architecture — each node processes its own slice of the data with multi-core processors, making them many times faster than traditional RDBMS.
Some MPP databases, like Pivotal Greenplum, have mature machine learning libraries that allow for in-database analytics. However, as with traditional RDBMS, most MPP databases do not support unstructured data, and even structured data will require some processing to fit the MPP infrastructure; therefore it takes additional time and resources to set up the data pipeline for an MPP database.
Since MPP databases are ACID-compliant and deliver much faster speed than traditional RDBMS, they are usually employed in high-end enterprise data warehousing solutions such as Amazon Redshift, Pivotal Greenplum, and Snowflake. As an industry example, the New York Stock Exchange receives four to five terabytes of data daily and conducts complex analytics, market surveillance, capacity planning and monitoring.
The company had been using a traditional database that couldn’t handle the workload, which took hours to load and had poor query speed. Moving to an MPP database reduced their daily analysis run time by eight hours.
Another innovation that completely transformed enterprise big data analytics capabilities is the rise of cloud services.
In the bad old days before cloud services were available, businesses had to buy on-premises data storage and analytics solutions from software and hardware vendors, usually paying upfront perpetual software license fees and annual hardware maintenance and service fees. On top of those are the costs of power, cooling, security, disaster protection, IT staff, etc, for building and maintaining the on-premises infrastructure.
Even when it was technically possible to store and process big data, most businesses found it cost prohibitive to do so at scale.
Scaling with on-premises infrastructure also require an extensive design and procurement process, which takes a long time to implement and requires substantial upfront capital. Many potentially valuable data collection and analytics possibilities were ignored as a result.
“As a Service” providers: e.g. Infrastructure as a Service (IaaS) and Storage as a Service (STaaS) (Source: IMELGRAT.ME)
The on-premises model began to lose market share quickly when cloud services were introduced in the late 2000’s — the global cloud services market has been growing 15% annually in the past decade.
Cloud service platforms provide subscriptions to a variety of services (from virtual computing to storage infrastructure to databases), delivered over the internet on a pay-as-you-go basis, offering customers rapid access to flexible and low-cost storage and virtual computing resources.
Cloud service providers are responsible for all of their hardware and software purchases and maintenance, and usually have a vast network of servers and support staff to provide reliable services.
Many businesses discovered that they could significantly reduce costs and improve operational efficiencies with cloud services, and are able to develop and productionize their products more quickly with the out-of-the-box cloud resources and their built-in scalability.
By removing the upfront costs and time commitment to build on-premises infrastructure, cloud services also lower the barriers to adopt big data tools, and effectively democratized big data analytics for small and med-size businesses.
There are several cloud services models, with public clouds being the most common. In a public cloud, all hardware, software, and other supporting infrastructure are owned and managed by the cloud service provider.
Customers share the cloud infrastructure with other “cloud tenants” and access their services through a web browser.
A private cloud is often used by organizations with special security needs such as government agencies and financial institutions. In a private cloud, the services and infrastructure are dedicated solely to one organization and are maintained on a private network.
The private cloud can be on-premises, or hosted by a third-party service provider elsewhere. Hybrid clouds combine private clouds with public clouds, allowing organizations to reap the advantages of both.
In a hybrid cloud, data and applications can move between private and public clouds for greater flexibility: e.g. the public cloud could be used for high-volume, lower-security data, and the private cloud for sensitive, business-critical data like financial reporting.
The multi-cloud model involves multiple cloud platforms, each delivers a specific application service. A multi-cloud can be a combination of public, private, and hybrid clouds to achieve the organization’s goals. Organizations often choose multi-cloud to suit their particular business, locations, and timing needs, and to avoid vendor lock-in.
If you want to know about a case study in data management, go to this link
“Everything a Data Scientist Should Know About Data Management”– Phoebe Wong Tweet