Following the last post about data management and everything a data scientist should now, we have now a case study.
It requires an understanding of how all the parts of the enterprise’s ecosystem work together, starting with where/how the data flows into the data team, the environment where the data is processed/transformed, the enterprise’s conventions for visualizing/presenting data, and how the model output will be converted as input for some other enterprise applications.
The main goals involve building a process that will be easy to maintain; where models can be iterated on and the performance is reproducible; and the model’s output can be easily understood and visualized for other stakeholders so that they may make better informed business decisions.
Achieving those goals require selecting the right tools, as well as an understanding of what others in the industry are doing and the best practices.
Let’s illustrate with a scenario: suppose you just got hired as the lead data scientist for a vacation recommendation app startup that is expected to collect hundreds of gigabytes of both structured (customer profiles, temperatures, prices, and transaction records) and unstructured (customers’ posts/comments and image files) data from users daily.
Your predictive models would need to be retrained with new data weekly and make recommendations instantaneously on demand. Since you expect your app to be a huge hit, your data collection, storage, and analytics capacity would have to be extremely scalable.
How would you design your data science process and productionize your models? What are the tools that you’d need to get the job done? Since this is a startup and you are the lead — and perhaps the only — data scientist, it’s on you to make these decisions.
First, you’d have to figure out how to set up the data pipeline that takes in the raw data from data sources, processes the data, and feeds the processed data to databases.
The ideal data pipeline has low event latency (ability to query data as soon as it’s been collected); scalability (able to handle massive amount of data as your product scales); interactive querying (support both batch queries and smaller interactive queries that allow data scientists to explore the tables and schemas); versioning (ability to make changes to the pipeline without bringing down the pipeline and losing data); monitoring (the pipeline should generate alerts when data stops coming in); and testing (ability to test the pipeline without interruptions).
Perhaps most importantly, it had better not interfere with daily business operations — e.g. heads will roll if the new model you’re testing causes your operational database to grind to a halt.
Building and maintaining the data pipeline is usually the responsibility of a data engineer (for more details, this articlehas an excellent overview on building the data pipeline for startups), but a data scientist should at least be familiar with the process, its limitations, and the tools needed to access the processed data for analysis.
Next, you’d have to decide if you want to set up on-premises infrastructure or use cloud services.
For a startup, the top priority is to scale data collection without scaling operational resources. As mentioned earlier, on-premises infrastructure requires huge upfront and maintenance costs, so cloud services tend to be a better option for startups.
Cloud services allow scaling to match demand and require minimal maintenance efforts, so that your small team of staff could focus on the product and analytics instead of infrastructure management.
In order to choose a cloud service provider, you’d have to first establish the data that you’d need for analytics, and the databases and analytics infrastructure most suitable for those data types.
Since there’d be both structured and unstructured data in your analytics pipeline, you might want to set up both a Data Warehouse and a Data Lake.
An important thing to consider for data scientists is whether the storage layer supports the big data tools that are needed to build the models, and if the database provides effective in-database analytics.
For example, some ML libraries such as Spark’s MLlib cannot be used effectively with databases as the main interface for data — the data would have to be unloaded from the database before it can be operated on, which could be extremely time-consuming as data volume grows and might become a bottleneck when you’ve to retrain your models regularly (thus causing another “heads-rolling” situation).
For data science in the cloud, most cloud providers are working hard to develop their native machine learning capabilities that allow data scientists to build and deploy machine learning models easily with data stored in their own platform (Amazon has SageMaker, Google has BigQuery ML, Microsoft has Azure Machine Learning).
But the toolsets are still developing and often incomplete: for example, BigQuery ML currently only support linear regression, binary and multiclass logistic regression, K-means clustering, and TensorFlow model importing.
If you decide to use these tools, you’d have to test their capabilities thoroughly to make sure they do what you need them to do.
Another major thing to consider when choosing a cloud provider is vendor-lock in. If you choose a proprietary cloud database solution, you most likely won’t be able to access the software or the data in your local environment, and switching vendor would require migrating to a different database, which could be costly.
One way to address this problem is to choose vendors that support open source technologies (here’s Netflix explaining why they use open source software). Another advantage of using open source technologies is that they tend to attract a larger community of users, meaning it’d be easier for you to hire someone who has the experience and skills to work within your infrastructure.
Another way to address the problem is to choose third-party vendors (such as Pivotal Greenplum and Snowflake) that provide cloud database solutions using other major cloud providers as storage backend, which also allows you to store your data in multiple clouds if that fits your startup’s needs.
Finally, since you expect the company to grow, you’d have to put in place a robust cloud management practice to secure your cloud and prevent data loss and leakages — such as managing data access and securing interfaces and APIs.
You’d also want to implement data governance best practices to maintain data quality and ensure your Data Lake won’t turn into a Data Swamp.
As you can see, there’s so much more in an enterprise data science project than tuning the hyperparameters in your machine learning models!
We hope this high-level overview has gotten you excited to learn more about data management, and maybe pick up a few things to impress the data engineers at the water cooler.
“Building the End-to-End Data Science Infrastructure for a Recommendation App Startup”– Phoebe Wong Tweet