Pandas Essentials For Data Science

Mahbubul Alam
Jul 13, 2020



Python is a popular language in data science, and of course, the most popular language for machine learning in production.

However, if you look at the whole landscape of data science, analytics and business intelligence across industries and academia, you’ll realize there are opportunities for Python to grow with new tools and techniques.

As an example, time series analysis made tremendous progress in R environment. And that’s because of its low barrier to entry with rich libraries such as fpp2. Python is still popular in some fields of time series, but still a long way to go for a fpp2 equivalent in the forecasting arena.

First, it requires more and more practitioners interested in doing data science in Python. For beginners, the most important is ensuring a low barrier to entry.

The purpose of this article is to talk about the utilities of Python pandasto set up your environment for data analysis. It is an effort to show how learning only few commands can kick-off to more advanced modeling.

What is Pandas?

pandas is a powerful, flexible and accessible data mining library in Python. It was originally developed at a financial management company. Anyone familiar with the finance sector knows a lot of its data science is actually time series analysis.

In fact, the name Pandas came from panel data, which is a special type of time series data used in econometrics. If you are interested in econometrics and its applications in data science check this out:

Pandas for data wrangling

Data scientists spend a bulk of their time (some would say 80%) in wrangling and preparing data for analysis. Since pandas was designed specifically to meet this part of the analytics pipeline, if you know how it works and how to make the best use of it for data prep, the rest is easy.

So here is an analytics pipeline to making data analysis-ready — from basic analytics to advanced modeling.

Importing data

The first order of business is installing the pandas library of course.
import pandas as pd

Now, you can import data from a variety of sources. It could be a file from your computer, text from the web or by querying a SQL database.

Data also comes in a variety of formats such as csv, excel, json etc.

So knowing where to import from and which format it is will determine what commands to use. Here’s a couple of examples.

# improt a csv from the local machine or from the web
df = pd.read_csv("Your-Data-Path.csv")
# importing an excel file from the computer
df = pd.read_excel("Your-Data-Path.xlsx")

Data inspection

After importing data you want to check out a few things such as the data structure, number of rows and columns, unique values, NaN values etc.

# description of index, entries, columns, data types, memory info
df.info()
# know the number of rows and columns
df.shape
# check out first few rows
df.head()
# if too many columns, list all of them
df.columns
# number of unique values of a column
df["column_name"].nunique()
# show all unique values of ONE column
df["column_name"].unique()
# number of unique values in ALL columns altogether
df.columns.nunique()

Missing values

Having missing values in a dataset should come as no surprise. First, you need to check if there are missing values:

# checking out number of missing values in each column
df.isnull().sum()
# number of missing values as a percentage of total observations
df.isnull().sum()*100/len(df)

Now, once you have identified that there are missing values, a few things you could do — drop the missing value rows, drop an entire column, substitute values — all depending on your analytical/modeling needs. Here are some basic commands:

# drop all rows containing null
df.dropna()
# fill na values with strings
df.fillna("data missing")
# fill na values with mean of columns
df.fillna(df.mean())

I wrote an entire article about dealing with missing values if you’d like to check out:

Column operations

By column operations I mean one of several things — selecting columns, dropping columns, renaming, adding new ones, sorting etc. In advanced analytics you might want to create a new column, calculated based on existing columns (e.g. creating an “age” based on existing “date_of_birth” column).

# select a column by name
df["column_name"]
# select multiple columns by column name
df[["column_name1", "column_name2"]] # notice the double brackets
# select first 3 columns based on column locations
df.iloc[:, 0:4]
# select columns 1, 2, 5
df.iloc[:, [1, 2, 5]]
# drop a column
df.drop("column_name", axis = 1)
# create a list of all columns in a dataframe
df.columns.tolist()
# rename a column
df.rename(columns = {"old_name": "new_name"})
# create a new column by multiplying an old column by 2
df["new_column_name"] = df["existing_column"] * 2
# sorting a column value in an ascending order
df.sort_values(by = "column_name", ascending = True)

Row operations
Once you have taken care of the columns, next up rows. You need to work with rows for a number of reasons, most notably filtering or slicing data. In addition, you might want to add new observations in your dataframe or remove some existing ones. Below are some commands you’d need to filter data:

# select rows 3 to 10
df.iloc[3:10, ]
# select 3 to 10 rows AND columns 2 to 4
df.iloc[3:10, 2:5]
# take a random sample of 10 rows 
df.sample(10)
# select rows with specific string
df[df["colum_name"].isin(["Batman"])]
# conditional filtering: filter rows with >5 
df.query("column_name > 5")

Special case: preparing time series data
Time series is a different kind of object, not like any dataframe. Raw data is usually not formatted for time series analysis, so pandas library treats them like a normal dataframe where the time dimension is stored as strings rather than a datetime object. So you need to transform the ordinary dataframe to a time series object.

# convert Date column to a datetime object
df["Date"] = pd.to_datetime(df["Date"])
# set Date as the index
df = df.set_index("Date")
# add new columns by splitting index
df["Year"] = df.idex.year
df["Month"] = df.index.month
df["Weekday"] = df.index.weekday_name

For more about time series data preparation here is an article about it:

Final word

Everyone starts their data science journey from different starting points. However, the level of understanding and time it takes to reach the goal differ significantly because everyone takes a different learning path. Using Python for learning the fundamentals of data wrangling shouldn’t be hard if someone follows a logical learning process. In this article, I outlined that logical order — data imports, inspection, missing values, column operations, rows operations — with some most frequently used commands. I hope that was useful in your data science journey.

“Pandas Essentials For Data Science”
– Mahbubul Alam twitter social icon Tweet


Share this article:

0 Comments

Post a comment
Log In to Comment

Related Stories

Oct 16, 2021

6 Advanced Statistical Concepts in Data Science

The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Be...

Nagesh Singh Chauhan
By Nagesh Singh Chauhan
Oct 09, 2021

Top 10 Python Extensions for Visual Studio Code

In this new post we want to talk about the most useful Python extensions for Visual Studio Code. Visual Studio Code is an integrated development en...

Daniel Morales
By Daniel Morales
Sep 25, 2021

10 Highly Probable Data Scientist Interview Questions

The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data s...

Soner Yıldırım
By Soner Yıldırım
Icon

Join our private community in Slack

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!

 
We'll send you an invitational link to your email immediatly.
arrow-up icon