Why you’re not a job-ready data scientist (yet)

Jeremie Harris
Jul 06, 2020

Why you’re not a job-ready data scientist (yet)

Jul 06, 2020 8 minutes read

If there’s one thing I’ve learned from the data science mentorship startup I work at, it’s this: getting feedback on your data science job application or interview is virtually impossible.

There are good reasons that companies are cagey about giving feedback. For one, every piece of feedback a company gives to a rejected applicant is a potential lawsuit. Plus, there’s the fact that many people don’t respond well to negative feedback, and some get downright combative.

And just imagine the time it would take for a recruiter to send a thoughtful feedback email to you—and to the dozens (or hundreds) of other applicants they also have to consider. And there’s the fact that, at the end of the day, they get absolutely nothing out of issuing any kind of feedback, no matter how helpful or obvious it may be.

The tragic end result of all this is a huge number of confused, directionless aspiring data scientists. But here’s some good news: there aren’t actually that many reasons why applicants get turned down from data science roles, and there’s a lot you can do to cover those bases.

And those reasons — the technical and nontechnical skills that most applicants don’t have but that companies most badly want — are what this post is all about.

Reason 1: Python-for-data-science skills

The vast majority of data science roles are Python-based, so that’s what I’ll focus on here. A few tools distinguish novices from job-ready pros when it comes to Python for DS. They’re great differentiators if you want to build outstanding projects that get noticed by employers.

To force yourself to improve your data science theory and implementation game, use these in a few projects, if you haven’t already:
  • Data exploration. You should have pandas functions like .corr(), scatter_matrix() , .hist() and .bar() on the tip of your tongue. You should always be looking for opportunities to visualize your data using PCA or t-SNE, using sklearn's PCA and TSNE functions.
  • Feature selection. 90% of the time, your dataset will have way more features than you need (which leads to excessive training time, and a heightened risk of overfitting). Get familiar with basic filter methods (look up scikit-learn’s VarianceThreshold and SelectKBest functions), and more sophisticated model-based feature selection methods (look up SelectFromModel).
  • Hyperparameter search for model optimization. You definitely should know what GridSearchCV does and how it works. Likewise for RandomSearchCV. To really stand out, try experimenting with skopt's BayesSearchCV to learn how you can apply bayesian optimization to your hyperparameter search.
  • Pipelines. Use sklearn's pipeline library to wrap their preprocessing, feature selection and modeling steps together. Discomfort with pipeline is a huge tell that a data scientist needs to get more familiar with their modeling toolkit.

Reason 2: probability and statistics knowledge

Probability and statistics don’t always come up explicitly during on the job, but they’re foundational to all data science work. As a result, it’s easy to bomb an interview if you haven’t read up on:
  • Bayes’s theorem. It’s a foundational pillar of probability theory, and it comes up all the time in interviews. You should practice doing some basic Bayes theorem whiteboarding problems, and read the first chapter of this famous book to get a rock-solid understanding of the origin and meaning of the rule (bonus: it’s actually a fun read!).
  • Basic probability. You should be able to answer questions like these.
  • Model evaluation. In classification problems, for example, most n00bs default to using model accuracy as their metric, which is usually a terrible choice. Get comfortable with sklearn's precision_score, recall_score, f1_score , and roc_auc_score functions, and the theory behind them. For regression tasks, understanding why you would use mean_squared_error rather than mean_absolute_error (and vice-versa) is also crucial. It’s really worth taking the time to check out all the model evaluation metrics listed in sklearn's official documentation.

Reason 3: software engineering know-how

Increasingly, data scientists are required to take on software engineering work. Many employers insist that applicants understand how to manage their code and keep clean notebooks and scripts. In particular:
  • Version control. You should know how to use git , and interact with your remote GitHub repos using the command line. If you don’t, I suggest starting with this tutorial.
  • Web development. Some companies like their data scientists to be comfortable accessing data that’s stored on their web app, or via an API. Getting comfortable with the basics of web development is important, and the best way to do that is to learn a bit of Flask.
  • Web scraping. Sort of related to web development: sometimes, you’ll need to automate data collection by scraping data from live websites. Two great tools to consider for this are BeautifulSoup and scrapy.
  • Clean code. Learn how to use docstrings. Don’t overuse inline comments. Break your functions up into smaller functions. Way smaller. There shouldn’t be functions in your code longer than 10 lines of code. Give your functions good, descriptive names ( function_1 is not a good name). Follow pythonic convention and name your variables with underscores like_this and not LikeThis or likeThis . Don’t write python modules ( .py files) with more than 400 lines of code. Each module should have a clear purpose (e.g. data_processing.py, predict.py ). Learn what an if name == '__main__': code block does and why it’s important. Use list comprehension. Don’t over-use for loops. Add a README file to your project.

Reason 4: business instinct

An alarming number of people seem to think that getting hired is about showing that you’re the most technically competent applicant to a role. It’s not. In reality, companies want to hire people who can help them make more money, faster.

In general that means moving beyond just technical ability, and building a number of additional skills:
  • Making something people want. When most people are in “data science learning mode”, they follow a very predictable series of steps: import data, explore data, clean data, visualize data, model data, evaluate model. And that’s fine when you’re focused on learning a new library or technique, but going on autopilot is a really bad habit in a business environment, where everything you do costs the company time (money). You’ll want to get good at thinking like a business, and making good guesses as to how you can best leverage your time to make meaningful contributions to your team and company. A great way to do this is to decide on some questions that you want your data science projects to answer before you begin them (so that you don’t get carried away with irrelevant tasks that form part of the otherwise “standard” DS workflow). Make these questions as practical as possible, and after you’ve completed your project, reflect on how well you were able to answer them.
  • Asking the right questions. Companies want to hire people who are able to keep the big picture in mind while they tune their models, and ask themselves questions like, “am I building this because it’s going to be legitimately helpful to my team and company, or because it’s a cool use case for an algorithm I really like?” and “what key business metric am I trying to optimize, and is there a better way to do that?”.
  • Explaining your results. Management needs you to tell them what products are selling well, or which users are leaving for a competitor and why, but they have no idea (and don’t care about) what a precision/recall curve is, or how hard it was for you to avoid overfitting your model. For that reason, a key skill is the ability to convey your results and their implications to nontechnical audiences. Try building a project and explaining it to a friend who hasn’t taken math since high school (hint: your explanation shouldn’t involve any algorithm names, or refer to hyperparameter tuning. Simple words are better words.).

Of course no list of like this one can be exhaustive, but from what I’ve seen coaching hundreds of early career data scientists through the job application and interview process (and from talking to our hiring partners themselves), it probably accounts for 70% of the rejections people get.

Keep in mind that other, less well-defined things like personality fit can often be a factor, too. If you didn’t get along with your interviewer, or if the conversation felt strained or awkward, it’s always possible that your technical qualifications are solid, but that you didn’t hit check the culture fit box. Companies regularly turn down applicants who would have been amazing technical performers for exactly this reason, so don’t take a rejection or two too much to heart!

If you are ready, you can find a data science job here.

If you want to get in touch, connect with me on Twitter anytime at @jeremiecharris.
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!