“It was Friday evening. I clearly remember how excited I was to spend the rest of the day with my family. My parents had traveled to Bangalore for the first time and I already had plans of showing them the city. I had completed my work for the day and given Friday evening’s are usually less hectic in any organization. I was about to leave when I suddenly received an email from my stakeholder asking me for a very old report that we stopped delivering a year back. I was disappointed but I knew that standard codes usually don’t take much effort to run. Alas! My hypothesis was wrong. Karma did bite me and I ended up spending the entire evening……”
You must be wondering how this story relates to any kind of coding skills required for a data science job or a few of you might have guessed the correlation between my story and the rest of the article already. Let me start by making a statement “Organizations today are using data science as one of the key levers at every stage of the decision cycle to drive key business strategies”. But what constitutes a Data Science problem? How does a Data Analyst, a Business Analyst or a Data Scientist work in any Organization?
Any data Science problem is broken down into two parts, a set of “activities” and a handful of “best practice processes”. Data gathering, data cleaning, data wrangling, hypothesis testing, model development, validation and many more come under “activities”. Most of us working with Analytics Organizations or aspire to become a data scientist are much familiar with such terminologies but when it comes to “best practice processes”, a little is known and followed.
From my 4 years of experience working in this industry, I can tell you that as a part of “best practice processes” a lot of emphases is given to project management, repository building, documentation, communication and code maintenance. As a data scientist, you are expected to ensure that you adhere to the “3 C’s, Consistency, Communication, and Consumption”. Your work should be consistent, you should communicate every business intricacies to your stakeholders and most importantly your work should be consumed.
Coding in the data science industry is very different from software development. Your coding skills are not simply restricted to your technical know-how but require a good amount of data and business understanding. Today I am going to talk about “Consistency” and how to bring that in your coding practices. A lot of these best practices are based on my 3 years experience of working with Mu Sigma Business Solutions and the challenges I faced till date. The following 5 sniffs will give you a brief understanding of how “Math + Business + Data + Technology = Data Science”.
1. Is your code readable enough?
A well-formatted and commented code is a code master’s heaven. It helps you debug your code easily and ensure smoother quality checks. Any data science team follows the concept of Peer Quality Checks (QC) to ensure error-free output. It is considered as a best practice to get your code reviewed by your peer before you deliver the final output to your stakeholder. A readable code includes:
- Project Name, Purpose of the Code, Version, Author Name, Date of Creation, Date of Modification, Last Modified By, Changes to begin with
- One-line description of every code snippet before it begins (Using these coders keep a track of explicit business rules or filters that are used)
- Proper indentation of every snippet with enough spaces between two snippets
- Proper use of naming convention for table names. Instead of writing a snippet like “create table a” one can always write “create table customer_volume_summary”. This makes your table intuitive even without going through the rest of the code segment
2. Does your code have reusable modules?
Many times you will end up working on datasets with similar schema but uses different filters for different attributes (columns) depending on the business problem at stake. It might also be the other way round where you end up using the same table to summarize different scenarios for the same business problem.
E.g. Say you have a Customer dataset that contains customer id, transaction id, purchase date, product type, and sales figures. You are asked to find out top customers contributing to 80% of your sales for each product type. In a normal scenario, you would create one set of code that does this for a specific product type, copy-paste the same and reuse subsequently with a different filter for the product type. A good coder would create a user-defined module which takes the product type and sales threshold as an input and generates the desired output. Reusable modules can be created across all platforms which save unnecessary lines and ensures easy QC.
3. Does your output meet data and business sense checks?
In any organization delivering correct figures are the key to a project’s success. A lot of business decisions are taken based on the numbers we report and even a small inaccuracy might have a bigger impact on the organization’s decision.
Let’s take the above example of the Customer dataset. You are asked to identify the top 100 Loyal Customers based on their transaction volume. Using your recommendation, the organization will provide such users with a 30% discount coupon for their upcoming Summer Sales.
The whole idea of this analysis is to identify loyal customers and influence them to purchase more. The organization is taking a hit on its sales to ensure a higher volume of transactions. Now a lot of us are unaware of the fact that datasets used by most organizations don’t come in the cleanest form and need to be manipulated before they can be consumed. E.g. a customer while paying his bills might experience a failed transaction which is recorded by the system. However, while calculating the transaction volume such transaction id’s should be eliminated else we will end up with an incorrect estimation of every customer’s transaction volume. The challenge is to identify such anomaly. Here are some of the checks you must do and capture while coding:
- Check the level of the dataset before you start any manipulation. A level is defined as a single or a combination of columns that can be used to identify a record uniquely from a database/table. This will help you identify any duplicate entries and prevent double-counting
- Do a quick descriptive statistics for your dataset. This helps you figure out the distribution of your data and all possible missing values
- While joining one or multiple tables ensure that they are of the same level. Do keep a track of the number of records before and after all join statements. This will help you identify multiple mapping or double counting if any
- Have key performance indicators of your organization at your fingertips. This will help you baseline numbers at every step. In the above example, total customers making a purchase is a key indicator of your organization’s performance. Based on your organization’s annual report you are aware that ~12,000 customer purchased a product from your organization’s website. However, on querying the dataset, you find only 8000 customer ids. Is your data correct? Recheck your codes or flag a data issue with your stakeholder
4. Is your code Input Resilient?
Input resilience implies that code should be able to produce an output irrespective of the input type. The hardest part of any coding exercise is inducing input resilience which results in the re-usability of codes. An analyst might receive similar data requests from different stakeholders. Hence an ideal scenario would be to write codes that can work on different business requirements.
E.g. the Sales Head of both Electronics and Cosmetics division wants to understand their respective product sales for different customer id. Since the Analyst has worked on laptops and tablets before, he/she is aware that all product names linked to electronic items are captured in lower case. However, to check cosmetic products, he/she might have to first filter out all products and see how transactions related to cosmetics are captured in the Customer Data. Instead, a simple use of UPPER() in the product columns can prevent unnecessary checks. In a real-world scenario, it is difficult to make every code input resilient but one needs to think of all possible exceptions that can be handled.
5. Induce Exception Handling
Input resilience and exception handling might sound similar but works differently. Let me ask you a simple question? How many records do you think a Customer Data of any organization has? 20,252 records as shown in the example above? You got to be kidding me! Any Customer Data which has transaction information for different products will have millions of records. Querying such datasets in SQL, R, Python or even Alteryx takes hours to run. Now imagine a scenario where you need to query such tables for different products, how do you induce exception handling?
- All programming language allows you to set your execution exception process in such a way that one failed query will stop the execution of the subsequent snippets. This allows you to make necessary corrections instantly once a condition fail or an error is thrown instead of waiting for a long time before the entire code executes
- Ensure you keep a check of your query time. Often querying larger datasets might take more time than expected due to concurrent usage, i.e. multiple users might be querying the same dataset at the same time. Ensure auto-stop of code execution once it goes beyond a certain amount of run time
If your coding skills are good enough, you can find a job as data scientist here.
“Are Your Coding Skills Good Enough For A Data Science Job?”– Angel Das Tweet