Organizations have different combinations of similar technologies to create their own unique stack. But there are some trends going around and if you’re starting a new team, organization, or company it might serve you to emulate one of the existing stacks in the early days and then build it to your own needs as you see fit. And there are lots of antiquated technologies out there that might need an upgrade.
For the following stacks I’ve included the most used technology in each part of the stack. This does not include application and model deployment — cloud choice, containers, CI/CD tooling, etc. I’ll save that for my engineering and DevOps friends to explore. This info comes from conversations with fellow data people from each listed company based on publicly available data.
Here are some data stacks I’ve encountered recently in talks with various Data Engineers, Data Scientists, and Analysts:
- Database: MySQL
- Warehouse: PostgreSQL, Snowflake
- ETL: Embulk, Python, Airflow
- Visualizations: Redash, Metabase
- AI/ML: None
- Database: PostgreSQL
- Warehouse: PostgreSQL + Stitch
- ETL: Lots and lots of Python
- Visualizations: Matplotlib, TensorBoard (sorta counts?)
- AI/ML: TensorFlow everywhere, some Sklearn and from scratch operations sprinkled in.
- Database: MongoDB (NoSQL), moving to DynamoDB (NoSQL)
- Warehouse: Amazon Redshift
- ETL: Airflow, Python
- Visualizations: Little bit of everything
- AI/ML: Decent amount of everything
- Database: SQL Server (almost exclusive Azure SQL DB)
- Warehouse: Azure Synapse (SQL DW), Snowflake
- ETL: Azure Data Factory, Python
- Visualizations: Tableau, Power BI
- Analytics: little bit of everything
- AI/ML: little bit of everything
- Database: Redis, SQL Server
- Warehouse: Azure Databricks (Spark)
- ETL: Azure Data Factory, Python
- Visualizations: Redash
- AI/ML: random one-offs, user’s preference
- Database: MySQL (others wandering around with lower use)
- Warehouse: Hive (Hive as primary but others are roaming about)
- ETL: 50 different tools (exaggeration, but really no structure here)
- Visualizations: subscriptions to all the major viz tools
- AI/ML: Everything under the sun, depends on the user’s preference
- Database: MySQL, Cassandra (NoSQL), custom built off another DB
- Warehouse: Hadoop & custom/from scratch
- ETL: Many different use-cases resulted in many different interactions in this layer of the stack. This company is extremely thoughtful about every decision in their stack… Have developed much of their ETL from scratch or off the back of existing tools.
- Visualizations: Everyday tools like Python libraries, R, and Tableau, but also developed many of their own tools, open-sourced some of them, etc...
- AI/ML: TensorFlow for Deep Learning, standard libs for everyday ML, tons of custom stuff built for managing models, tracking metrics, etc…
The best way to get proficient quickly is to emulate. To be great you need to figure out what works for you. Sure, trying to learn some of LeBron’s moves could make you a good basketball player. You might even spend countless hours trying to emulate his game. But you’re not LeBron. You might get really good through mimicking parts of his game. But if you’re nowhere near the superhuman capability of LeBron like me and can’t jump through the ceiling, you need to figure out what works best for your game to become great.
Note: there are many technologies I didn’t list here… some popular ones you might not have seen listed include Impala (engine for Hadoop), Rapidminer (analytics tool), R (programming language), PyTorch (ML library), and many others. Please don’t be mad if you didn’t see your favorite technology listed! It just means my small sample size of people I’ve talked to recently don’t use it in their day to day.
Thanks for reading!
Let’s continue the conversation on Twitter!
“Some Common Data Science Stacks”– Luke Posey Tweet