Six Recommendations for Aspiring Data Scientists

Ben Weber
Apr 26, 2020

Contents Outline

Six Recommendations for Aspiring Data Scientists

Apr 26, 2020 6 minutes read

Building experience before landing a job


Data science is a field with a huge demand, in part because it seems to require experience as a data scientist to be hired as a data scientist. But many of the best data scientists I’ve worked with have diverse backgrounds ranging from humanities to neuroscience, and it takes demonstrated experience to stand out. As a new grad or analytics professional making the jump to a data science career, it can be challenging to build a portfolio of work to demonstrate expertise in this space. I’ve been on both sides of the hiring process for data science roles and wanted to call out some of the key experiences that can help land a job as a data scientist:

  1. Get hands-on with cloud computing
  2. Create a new data set
  3. Glue things together
  4. Stand up a service
  5. Create a stunning visualization
  6. Write a white paper
I’ll expand on these topics in detail, but the key theme in data science is being able to build data products that add value to a company. A data scientist that can build these end-to-end data products is a valuable asset, and it’s useful to demonstrate these skills when pursuing a data science career.

Get hands-on with cloud computing

Many companies are looking for data scientists with past experience in cloud computing environments, because these platforms provide tools that enable data workflows and predictive models to scale to massive volumes. It’s also likely that you’ll be using a cloud platform, such as Amazon Web Services (AWS) or Google Cloud Platform (GCP) in your everyday work.

The good news is that many of these platforms provide free tiers for becoming familiar with the platform. For example, AWS has free-tier EC2 instances and free usage of services such as Lambda for low volume requests, GCP offers $300 of free credit to try out most of the platform, and Databricks provides a community edition that you can use to get hands on with the platform. With these free options you won’t be able to work with massive data sets, but you can build experience on these platforms.

One of my recommendation is to try out different features on these platforms, and see if you can use some of the tools to train and deploy models. For example, for my model services post I leveraged a tool I was already familiar with, SKLearn, and investigated how to wrap a model as a Lambda function.

Create a new data set

In academic courses and data science competitions, you’re often provided a clean data set where the focus of the project is on exploratory data analysis or modeling. However, for most real-world projects you’ll need to perform some data munging in order to clean a raw data set into a transformed data set that is more useful for an analysis or modeling task. Often, data mungling requires collecting additional data sets in order to transform data. For example, I worked with Federal Reserve data in a past role in order to better understand the asset allocation of affluent households in the US.

This was an interesting project, because I worked with third-party data in order to measure the accuracy of first-party data. My second recommendation is to actually go a step further and build a data set. This can include scraping a website, sampling data from an endpoint (e.g. steamspy), or aggregating different data sources into a new data set. For example, I created a custom data set of StarCraft replays during my graduate study, which demonstrated my ability to perform data munging on a novel data set.


Glue things together

One of the skills that I like to see data scientists demonstrate is the ability to make different components or systems work together in order to accomplish a task. In a data science role, there may not be a clear path to productizing a model and you may need to build something unique in order to get a system up and running. Ideally a data science team will have engineering support for getting systems up and running, but prototyping is a great skill for a data scientists to move quickly.

My recommendation here is to try to get different systems or components to integrate within a data science workflow. This can involve getting hands on with tools such as Airflow in order to prototype a data pipeline. It can involve creating a bridge between different systems, such as the JNI-BWAPIproject I started to interface the StarCraft Brood War API library with Java. Or it can involve gluing different components together within a platform, such as using GCP DataFlow to pull data from BigQuery, apply a predictive model, and store the results to Cloud Datastore.

Stand up a service

As a data scientist, you’ll often need to stand up services that other teams can use within your company. For example, this could be a Flask app that provides results from a deep learning model. Being able to prototype services means that other teams will be able to use your data products more quickly.

My recommendation is to get hands on experience with tools such a Flask or Gunicorn, in order to setup web endpoints, and Dash in order to create interactive web applications in Python. It’s also useful practice to try setting up one of these services in a Docker instance.

Create a stunning visualization

While great work should stand on its own, it’s often necessary to first get your audience’s attention before explaining why an analysis or model is important. My recommendation here is to learn a variety of visualization tools in order to create compelling visualizations that stand out.

Creating visualizations is also a useful way of building up a portfolio of work. The blog post below shows a sample of the different tools and datasets I explored over 10 years as a data scientist.

Write a white paper

One of the data science skills that I’ve been advocating for recently is the ability to explain projects in the form of a white paper that provides an executive summary, discusses how the work can be used, provides details about the methodology, and results. The goal is to make your research digestible by a wide audience and for it to be self explanatory, so that other data scientists can build upon it.

Blogging and other forms of writing are great ways of getting experience in improving your written communication. My recommendation here is to try writing data science articles for broad audiences, in order to get experience conveying ideas at different levels of detail.
Conclusion

Data science requires hands on experience with a number of tools. Luckily, many of these tools are becoming more accessible and it’s becoming easier to build out a data science portfolio.
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!