Using Pandas Profiling to Accelerate Our Exploratory Analysis

Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for a quick data analysis.

The following statistics for each column are presented in an interactive report.

  • Type information: detect the types of columns in a dataframe.
  • Essentials: type, single values, missing values
  • Quantile statistics as minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics such as mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
  • The most frequent values
  • Histograms
  • Outstanding correlations of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Lost Value Matrix, Count, Heat Map and Lost Value Dendrogram
  • Text Analysis learns about categories (Shift, Space), hyphens (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and image analysis extracts file sizes, creation dates, and dimensions and scans images that are truncated or contain EXIF information.

For this Notebook we will be working on the dataset found in the following link Meteorite landings

This comprehensive data set from the Meteorological Society contains information on all known meteorite landings. It is interesting to observe the places on earth where these objects have fallen, following the coordinates of the dataset.

Meteorite fall map

Screenshot%20from%202020-11-10%2007-22-17.png

Let's start now by importing the dataset, in order to understand a little bit the data we will work with

In [1]:
import pandas as pd

We have saved the Meteorite Falling dataset (Meteorite landings) in the 'datasets' folder of the present working environment, so we selected the right path for the import

In [2]:
df = pd.read_csv("datasets/Meteorite_Landings.csv")

And now we check the data

In [4]:
df.head()
Out[4]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333)
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333)
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0)
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9)
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95)
In [5]:
df.shape
Out[5]:
(45716, 10)

It is a very interesting dataset, where we can observe the name that the scientists gave to the meteorite, the type of class recclass, the weight in grams mass (g) the date in which it fell and the coordinates in which it fell.

It is also important to note that this is a very complete dataset, with 45,716 records and 10 columns. This information is given by the .shape

For more information about the dataset you can enter here: Meteorite - Nasa

Now, as we mentioned at the beginning, with Pandas we have gotten used to running the .describe() command to generate a descriptive analysis on the dataset in question. Descriptive statistics include those that summarize the central tendency, dispersion, and form of distribution of a data set, excluding the NaN values.

It analyzes both numerical and object series, as well as DataFrame column sets of mixed data types. The result will vary depending on what is provided. For more information about .describe() and the parameters we can pass, you can find the info here: Pandas describe

Now let's run this command over our dataset and see the result

In [6]:
df.describe()
Out[6]:
id mass (g) reclat reclong
count 45716.000000 4.558500e+04 38401.000000 38401.000000
mean 26889.735104 1.327808e+04 -39.122580 61.074319
std 16860.683030 5.749889e+05 46.378511 80.647298
min 1.000000 0.000000e+00 -87.366670 -165.433330
25% 12688.750000 7.200000e+00 -76.714240 0.000000
50% 24261.500000 3.260000e+01 -71.500000 35.666670
75% 40656.750000 2.026000e+02 0.000000 157.166670
max 57458.000000 6.000000e+07 81.166670 354.473330

The describe() method skips over the categorical columns (string type) and makes a descriptive statistical analysis on the numerical columns. Here we could see that the id column might not be useful for this analysis, since it is only a single indicator for each row (primary key of the table), while the mass is useful and interesting to understand, for example, the minimum and maximum value, the mean and the percentiles (25, 50, 75).

As we can see, this is a very basic analysis, without further relevant information. If we want more relevant information we must start writing code.

This is where Pandas profiling comes in, and its usefulness. The documentation of this library can be found in the following link. Pandas profiling. The installation will be done in the following way

Installing and importing Pandas Profiling

In [7]:
!pip3 install 'pandas-profiling[notebook,html]'
Requirement already satisfied: pandas-profiling[html,notebook] in ./lib/python3.6/site-packages (2.9.0)
  WARNING: pandas-profiling 2.9.0 does not provide the extra 'html'
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.1.4)
Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.19.4)
Requirement already satisfied: joblib in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.17.0)
Requirement already satisfied: confuse>=1.0.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.3.0)
Requirement already satisfied: attrs>=19.3.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (20.3.0)
Requirement already satisfied: tangled-up-in-unicode>=0.0.6 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.0.6)
Requirement already satisfied: missingno>=0.4.2 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.4.2)
Requirement already satisfied: phik>=0.9.10 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.10.0)
Requirement already satisfied: seaborn>=0.10.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.11.0)
Requirement already satisfied: ipywidgets>=7.5.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (7.5.1)
Requirement already satisfied: scipy>=1.4.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.5.4)
Requirement already satisfied: visions[type_image_path]==0.5.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.5.0)
Requirement already satisfied: requests>=2.23.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (2.24.0)
Requirement already satisfied: jinja2>=2.11.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (2.11.2)
Requirement already satisfied: tqdm>=4.43.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (4.51.0)
Requirement already satisfied: htmlmin>=0.1.12 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.1.12)
Requirement already satisfied: matplotlib>=3.2.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (3.3.2)
Requirement already satisfied: jupyter-client>=6.0.0; extra == "notebook" in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (6.1.7)
Requirement already satisfied: jupyter-core>=4.6.3; extra == "notebook" in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (4.6.3)
Requirement already satisfied: python-dateutil>=2.7.3 in ./lib/python3.6/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in ./lib/python3.6/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (2020.4)
Requirement already satisfied: pyyaml in ./lib/python3.6/site-packages (from confuse>=1.0.0->pandas-profiling[html,notebook]) (5.3.1)
Requirement already satisfied: numba>=0.38.1 in ./lib/python3.6/site-packages (from phik>=0.9.10->pandas-profiling[html,notebook]) (0.51.2)
Requirement already satisfied: traitlets>=4.3.1 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.3.3)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (7.16.1)
Requirement already satisfied: ipykernel>=4.5.1 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (5.3.4)
Requirement already satisfied: nbformat>=4.2.0 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (5.0.8)
Requirement already satisfied: widgetsnbextension~=3.5.0 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.5.1)
Requirement already satisfied: networkx>=2.4 in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (2.5)
Requirement already satisfied: imagehash; extra == "type_image_path" in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (4.1.0)
Requirement already satisfied: Pillow; extra == "type_image_path" in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (8.0.1)
Requirement already satisfied: certifi>=2017.4.17 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (2020.11.8)
Requirement already satisfied: chardet<4,>=3.0.2 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (1.25.11)
Requirement already satisfied: MarkupSafe>=0.23 in ./lib/python3.6/site-packages (from jinja2>=2.11.1->pandas-profiling[html,notebook]) (1.1.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (2.4.7)
Requirement already satisfied: cycler>=0.10 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (1.3.1)
Requirement already satisfied: tornado>=4.1 in ./lib/python3.6/site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[html,notebook]) (6.1)
Requirement already satisfied: pyzmq>=13 in ./lib/python3.6/site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[html,notebook]) (19.0.2)
Requirement already satisfied: six>=1.5 in ./lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (1.15.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in ./lib/python3.6/site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[html,notebook]) (0.34.0)
Requirement already satisfied: setuptools in ./lib/python3.6/site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[html,notebook]) (50.2.0)
Requirement already satisfied: ipython-genutils in ./lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.0)
Requirement already satisfied: decorator in ./lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.4.2)
Requirement already satisfied: pygments in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.7.2)
Requirement already satisfied: pickleshare in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.0.8)
Requirement already satisfied: pexpect; sys_platform != "win32" in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.8.0)
Requirement already satisfied: jedi>=0.10 in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.17.2)
Requirement already satisfied: backcall in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in ./lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.2.0)
Requirement already satisfied: notebook>=4.4.1 in ./lib/python3.6/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (6.1.5)
Requirement already satisfied: PyWavelets in ./lib/python3.6/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (1.1.1)
Requirement already satisfied: wcwidth in ./lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.5)
Requirement already satisfied: ptyprocess>=0.5 in ./lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.6.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in ./lib/python3.6/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.7.1)
Requirement already satisfied: pyrsistent>=0.14.0 in ./lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.17.3)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in ./lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.0.0)
Requirement already satisfied: Send2Trash in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.5.0)
Requirement already satisfied: nbconvert in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (6.0.7)
Requirement already satisfied: terminado>=0.8.3 in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.9.1)
Requirement already satisfied: prometheus-client in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.8.0)
Requirement already satisfied: argon2-cffi in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (20.1.0)
Requirement already satisfied: zipp>=0.5 in ./lib/python3.6/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.4.0)
Requirement already satisfied: testpath in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.4.4)
Requirement already satisfied: defusedxml in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.6.0)
Requirement already satisfied: bleach in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.2.1)
Requirement already satisfied: pandocfilters>=1.4.1 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.4.3)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.5.1)
Requirement already satisfied: jupyterlab-pygments in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.1.2)
Requirement already satisfied: entrypoints>=0.2.2 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.3)
Requirement already satisfied: mistune<2,>=0.8.1 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.8.4)
Requirement already satisfied: cffi>=1.0.0 in ./lib/python3.6/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.14.3)
Requirement already satisfied: packaging in ./lib/python3.6/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (20.4)
Requirement already satisfied: webencodings in ./lib/python3.6/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.5.1)
Requirement already satisfied: nest-asyncio in ./lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.4.2)
Requirement already satisfied: async-generator in ./lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.10)
Requirement already satisfied: pycparser in ./lib/python3.6/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.20)
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/home/daniel/Desktop/narrativetext_project/notebooks/bin/python -m pip install --upgrade pip' command.

It is mandatory to pass between quotes the command and followed by notebook,html this is because we will need these two functions of the library.

If you are using Conda, these are the other ways to install it: Installing Pandas profiling

Creating relevant columns to the analysis

Now we are going to create a series of columns that will be relevant to the analysis that we will do with Pandas Profiling, the first one will be to create a constant variable for all the records, this time we will say that all the records belong to NASA, and we do the following

In [8]:
df['source'] = "NASA"
In [9]:
df.head()
Out[9]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA

As we can see, this column was eventually created. We are now going to create a boolean variable at random, simulating some kind of boolean output for each record.

Remember that this is done so that our exploratory analysis can identify this type of data in the result.

In [11]:
# we imported numpy, it should have been installed with Pandas. If you don't have it, you can do it with 
# the `pip3 install numpy` command

import numpy as np
In [12]:
# numpy is going to help us create those random booleans in the next line of code

df['boolean'] = np.random.choice([True, False], df.shape[0])
In [13]:
df.head()
Out[13]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False

As we see, the column boolean was created with random values of True or False for each of the rows of our dataset, this is thanks to the df.shape[0] that refers to the rows or records of the dataset, that is to say that it made this operation 45,716 times, which is the total number of records.

Let's do now something similar, but mixing numerical data types and categorical data types (strings)

In [14]:
df['mixed'] = np.random.choice([1, 'A'], df.shape[0])
In [15]:
df.head()
Out[15]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean mixed
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False A
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False 1
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True A
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False 1
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False 1

As we can see, here we are simulating that a column has two types of data mixed together, both numerical and categorical. This is something that we can find in real datasets, and describe() of Pandas will simply ignore them, and will not give us any analysis about that column (remember that describe() only gives results about numerical columns, it even ignores the boolean columns too)

Now let's do something even more interesting. We are going to create a new column by simulating a high correlation with an existing column. In particular, we will do it on the column reclat that talks about the latitude where the meteorite has fallen, and we will add a normal distribution with a standard deviation of 5 and a sample size equal to the dataset longitude.

If you want to see how to create a simulation of a normal distribution with random numbers with Numpy, check this link. Random normal numpy

In [16]:
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5, size=(len(df)))
In [17]:
df.head()
Out[17]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean mixed reclat_city
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False A 45.108012
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False 1 44.845134
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True A 56.421836
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False 1 20.536132
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False 1 -24.426722

Let's check the result of the last command, we can see that this column reclat_city now has a high correlation with reclat, because when one observation or row is positive the other one too, and when one is negative, the other one too.

To analyze correlations with Pandas we use a different method than describe(), in this case we use the corr() command. However, with Pandas profiling both analyses (descriptive statistics and correlations) we will obtain them with only one command. We will see this in a few moments when we run our exploratory analysis.

Remember that for now what we are doing is adding columns to the dataframe in order to see all the possibilities offered by the Pandas profiling tool.

We are now going to simulate another common scenario in the datasets, and that is to have duplicate observations or rows. This we will do it like this:

In [18]:
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
In [19]:
duplicates_to_add
Out[19]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean mixed reclat_city
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False A 45.108012
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False 1 44.845134
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True A 56.421836
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False 1 20.536132
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False 1 -24.426722
5 Adhi Kot 379 Valid EH4 4239.0 Fell 01/01/1919 12:00:00 AM 32.10000 71.80000 (32.1, 71.8) NASA False A 35.951071
6 Adzhi-Bogdo (stone) 390 Valid LL3-6 910.0 Fell 01/01/1949 12:00:00 AM 44.83333 95.16667 (44.83333, 95.16667) NASA True A 48.626966
7 Agen 392 Valid H5 30000.0 Fell 01/01/1814 12:00:00 AM 44.21667 0.61667 (44.21667, 0.61667) NASA False 1 43.906622
8 Aguada 398 Valid L6 1620.0 Fell 01/01/1930 12:00:00 AM -31.60000 -65.23333 (-31.6, -65.23333) NASA True A -33.367825
9 Aguila Blanca 417 Valid L 1440.0 Fell 01/01/1920 12:00:00 AM -30.86667 -64.55000 (-30.86667, -64.55) NASA True A -18.912785

What we just did was to create a new dataframe from the first 10 rows of our original dataframe. To do this we use an iloc that serves to select rows and a slice selector to select from row 0 to row 10-1.

Now let's change the name to identify them later, but the other values remain the same

In [20]:
duplicates_to_add['name'] = duplicates_to_add['name'] + " copy"
In [21]:
duplicates_to_add
Out[21]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean mixed reclat_city
0 Aachen copy 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False A 45.108012
1 Aarhus copy 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False 1 44.845134
2 Abee copy 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True A 56.421836
3 Acapulco copy 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False 1 20.536132
4 Achiras copy 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False 1 -24.426722
5 Adhi Kot copy 379 Valid EH4 4239.0 Fell 01/01/1919 12:00:00 AM 32.10000 71.80000 (32.1, 71.8) NASA False A 35.951071
6 Adzhi-Bogdo (stone) copy 390 Valid LL3-6 910.0 Fell 01/01/1949 12:00:00 AM 44.83333 95.16667 (44.83333, 95.16667) NASA True A 48.626966
7 Agen copy 392 Valid H5 30000.0 Fell 01/01/1814 12:00:00 AM 44.21667 0.61667 (44.21667, 0.61667) NASA False 1 43.906622
8 Aguada copy 398 Valid L6 1620.0 Fell 01/01/1930 12:00:00 AM -31.60000 -65.23333 (-31.6, -65.23333) NASA True A -33.367825
9 Aguila Blanca copy 417 Valid L 1440.0 Fell 01/01/1920 12:00:00 AM -30.86667 -64.55000 (-30.86667, -64.55) NASA True A -18.912785

If we look, now all the names have the word 'copy' at the end. We already have this new dataset ready to concatenate it to the original dataset, so we can have duplicated data. Let's do now the append

In [22]:
df = df.append(duplicates_to_add, ignore_index=True)
In [23]:
df.head()
Out[23]:
name id nametype recclass mass (g) fall year reclat reclong GeoLocation source boolean mixed reclat_city
0 Aachen copy 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333) NASA False A 45.108012
1 Aarhus copy 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333) NASA False 1 44.845134
2 Abee copy 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0) NASA True A 56.421836
3 Acapulco copy 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9) NASA False 1 20.536132
4 Achiras copy 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95) NASA False 1 -24.426722
In [24]:
df.shape
Out[24]:
(45726, 14)

The original dataset contains 45716 rows, now we have 10 more rows, which are the duplicate rows. In fact we can see some of them in the above display!

Using Pandas profiling

Now we have arrived at the expected moment, we have added some columns to the dataset that will allow us to see interesting analyses on it. But before that, we must be fair to the pandas describe() and see what analysis it gives us on the resulting dataset

In [25]:
df.describe()
Out[25]:
id mass (g) reclat reclong reclat_city
count 45726.000000 4.559500e+04 38411.000000 38411.000000 38411.000000
mean 26883.906202 1.327843e+04 -39.107095 61.052594 -39.086822
std 16863.445566 5.749260e+05 46.386011 80.655258 46.606354
min 1.000000 0.000000e+00 -87.366670 -165.433330 -104.333501
25% 12681.250000 7.200000e+00 -76.713770 0.000000 -78.167480
50% 24256.500000 3.261000e+01 -71.500000 35.666670 -68.929999
75% 40653.500000 2.029000e+02 0.000000 157.166670 4.711421
max 57458.000000 6.000000e+07 81.166670 354.473330 80.190247

As we see, very little difference, it does not give us additional information about:

  • Boolean columns
  • Mixed columns
  • Correlations

This is where Pandas profiling shines by its simplicity to perform an exploratory analysis on our datasets. Without further ado, let's run the following command

In [26]:
## we already have the library installed, now we need to import it

import pandas_profiling
from pandas_profiling.utils.cache import cache_file
In [27]:
## now we run the report

report = df.profile_report(sort='None', html={'style':{'full_width':True}})
In [28]:
report