The numerical datasets that you work upon can be divided into three categories:
- Time Series Data
Data varying over time for a fixed geographical area. For eg., the population of India over the last 100 years.
2. Spatial Data
Data varying over a fixed time point over different geographical areas (say, country, state, etc.). For eg., the population of all the different countries in the world in the year 2020.
When the geographical area is defined by specific boundaries and points to a certain location like a factory, hospital, etc., it is known as cross-sectional data.
3. Categorical Data
Data varying over a fixed time point and a fixed geographical area for different categories. For eg., the population of males and females of India in the year 2020.
The combination of categorical or spatial data with a time series is known as panel data. For eg., the population of the different countries of the world over the last 10 years, or the population of males and females in India over the last 10 years.
Now that you know what the data is, the next step is scrutiny. There are 4 points that you need to look for:
Say, out of 100 students appearing in an exam, 40 students pass with grade A. 5% of the total students could not pass the exam. 50 students got a grade B and the remaining students failed.
The 2nd statement says that the no. of failures is 5% of 100 students, i.e., 5 students, whereas, as per the last statement, 10 students failed. It is observed that the given data is not consistent, i.e. the calculations do not match in this case, and hence, there is some discrepancy. So this data either needs to be corrected or removed.
2. Irregularity or Abnormality
Certain phenomena behave in a regular way and there is a pattern in them. Consider the temperature recorded in a specific city over a year. The data is expected to have lower values in the winter season and higher values in the summer season and this change should be gradual over the year. If there is a fluctuation, suppose a temperature of 12 degrees Celsius is recorded in the summer season, then one needs to find out the cause behind this irregularity and consider the remedial steps.
This can also refer to situations where the given figure cannot be present in the dataset. For eg., a value of 500 years for the age of a human being. Such values are usually removed from the dataset and not changed as they can refer to, as in this example, any of the values like 0, 5, or 50. Hence, it cannot be known for certain what that abnormal value should be replaced with.
3. Spurious regularity
Doubts also arise when the data fits exactly as per the ideal values. When the experimental data is very regular, it raises alarms that the experiment has not been done and the data has been created using theoretical knowledge because there are chance causes always present that lead to some fluctuations in the data values.
It should however be obvious that no hard and fast rules may be laid down for the scrutiny of data. You must use your common sense, judgment, and whatever knowledge you may have about the field of enquiry to assess the reliability of the data.
This article is part of a series “From A Statistician” where I talk about different tiny yet important details from the world of Statistics which will surely help one in becoming a better data scientist.
“Know your Data”– Ankita Prakash Tweet