Know your Data

Ankita Prakash
Aug 04, 2020

The most important aspect of data science is, without a doubt, data. Having a prerequisite knowledge of what forms the data can be and how to do basic scrutiny always helps. This article covers the three basic types of data and then moves on to the issues that are generally present in datasets and need to be worked upon.

The numerical datasets that you work upon can be divided into three categories:
  1. Time Series Data

Data varying over time for a fixed geographical area. For eg., the population of India over the last 100 years.

2. Spatial Data

Data varying over a fixed time point over different geographical areas (say, country, state, etc.). For eg., the population of all the different countries in the world in the year 2020.

When the geographical area is defined by specific boundaries and points to a certain location like a factory, hospital, etc., it is known as cross-sectional data.

3. Categorical Data

Data varying over a fixed time point and a fixed geographical area for different categories. For eg., the population of males and females of India in the year 2020.

The combination of categorical or spatial data with a time series is known as panel data. For eg., the population of the different countries of the world over the last 10 years, or the population of males and females in India over the last 10 years.

Photo by Luke Chesser on Unsplash

Now that you know what the data is, the next step is scrutiny. There are 4 points that you need to look for:
  1. Consistency

Say, out of 100 students appearing in an exam, 40 students pass with grade A. 5% of the total students could not pass the exam. 50 students got a grade B and the remaining students failed.

The 2nd statement says that the no. of failures is 5% of 100 students, i.e., 5 students, whereas, as per the last statement, 10 students failed. It is observed that the given data is not consistent, i.e. the calculations do not match in this case, and hence, there is some discrepancy. So this data either needs to be corrected or removed.

2. Irregularity or Abnormality

Certain phenomena behave in a regular way and there is a pattern in them. Consider the temperature recorded in a specific city over a year. The data is expected to have lower values in the winter season and higher values in the summer season and this change should be gradual over the year. If there is a fluctuation, suppose a temperature of 12 degrees Celsius is recorded in the summer season, then one needs to find out the cause behind this irregularity and consider the remedial steps.

This can also refer to situations where the given figure cannot be present in the dataset. For eg., a value of 500 years for the age of a human being. Such values are usually removed from the dataset and not changed as they can refer to, as in this example, any of the values like 0, 5, or 50. Hence, it cannot be known for certain what that abnormal value should be replaced with.

3. Spurious regularity

Doubts also arise when the data fits exactly as per the ideal values. When the experimental data is very regular, it raises alarms that the experiment has not been done and the data has been created using theoretical knowledge because there are chance causes always present that lead to some fluctuations in the data values.

It should however be obvious that no hard and fast rules may be laid down for the scrutiny of data. You must use your common sense, judgment, and whatever knowledge you may have about the field of enquiry to assess the reliability of the data.

This article is part of a series “From A Statistician” where I talk about different tiny yet important details from the world of Statistics which will surely help one in becoming a better data scientist.

“Know your Data”
– Ankita Prakash twitter social icon Tweet

Share this article:


Post a comment
Log In to Comment

Related Stories

Nov 25, 2021

5 Tips To Ace Your Job Interview For A Data Scientist Opening

5 Tips To Ace Your Job Interview For A Data Scientist Opening.PNG 795.94 KBImage SourceAspiring data scientists have a bright future ahead of them....

Daniel Morales
By Daniel Morales
Nov 12, 2021

When to Avoid Deep Learning

IntroductionThis article is intended for data scientists who may consider using deep learning algorithms, and want to know more about the cons of i...

Matt Przybyla
By Matt Przybyla
Oct 16, 2021

6 Advanced Statistical Concepts in Data Science

The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Be...

Nagesh Singh Chauhan
By Nagesh Singh Chauhan

Join our private community in Slack

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!

We'll send you an invitational link to your email immediatly.
arrow-up icon