Python is a popular language in data science, and of course, the most popular language for machine learning in production.
However, if you look at the whole landscape of data science, analytics and business intelligence across industries and academia, you’ll realize there are opportunities for Python to grow with new tools and techniques.
As an example, time series analysis made tremendous progress in R environment. And that’s because of its low barrier to entry with rich libraries such as fpp2. Python is still popular in some fields of time series, but still a long way to go for a fpp2 equivalent in the forecasting arena.
First, it requires more and more practitioners interested in doing data science in Python. For beginners, the most important is ensuring a low barrier to entry.
The purpose of this article is to talk about the utilities of Python pandasto set up your environment for data analysis. It is an effort to show how learning only few commands can kick-off to more advanced modeling.
What is Pandas?
pandas is a powerful, flexible and accessible data mining library in Python. It was originally developed at a financial management company. Anyone familiar with the finance sector knows a lot of its data science is actually time series analysis.
In fact, the name Pandas came from panel data, which is a special type of time series data used in econometrics. If you are interested in econometrics and its applications in data science check this out:
Pandas for data wrangling
Data scientists spend a bulk of their time (some would say 80%) in wrangling and preparing data for analysis. Since pandas was designed specifically to meet this part of the analytics pipeline, if you know how it works and how to make the best use of it for data prep, the rest is easy.
So here is an analytics pipeline to making data analysis-ready — from basic analytics to advanced modeling.
Importing data
The first order of business is installing the pandas library of course.
import pandas as pd
Now, you can import data from a variety of sources. It could be a file from your computer, text from the web or by querying a SQL database.
Data also comes in a variety of formats such as csv, excel, json etc.
So knowing where to import from and which format it is will determine what commands to use. Here’s a couple of examples.
# improt a csv from the local machine or from the web df = pd.read_csv("Your-Data-Path.csv") # importing an excel file from the computer df = pd.read_excel("Your-Data-Path.xlsx")
Data inspection
After importing data you want to check out a few things such as the data structure, number of rows and columns, unique values, NaN values etc.
# description of index, entries, columns, data types, memory info df.info() # know the number of rows and columns df.shape # check out first few rows df.head() # if too many columns, list all of them df.columns # number of unique values of a column df["column_name"].nunique() # show all unique values of ONE column df["column_name"].unique() # number of unique values in ALL columns altogether df.columns.nunique()
Missing values
Having missing values in a dataset should come as no surprise. First, you need to check if there are missing values:
# checking out number of missing values in each column df.isnull().sum() # number of missing values as a percentage of total observations df.isnull().sum()*100/len(df)
Now, once you have identified that there are missing values, a few things you could do — drop the missing value rows, drop an entire column, substitute values — all depending on your analytical/modeling needs. Here are some basic commands:
# drop all rows containing null df.dropna() # fill na values with strings df.fillna("data missing") # fill na values with mean of columns df.fillna(df.mean())
I wrote an entire article about dealing with missing values if you’d like to check out:
Column operations
By column operations I mean one of several things — selecting columns, dropping columns, renaming, adding new ones, sorting etc. In advanced analytics you might want to create a new column, calculated based on existing columns (e.g. creating an “age” based on existing “date_of_birth” column).
# select a column by name df["column_name"] # select multiple columns by column name df[["column_name1", "column_name2"]] # notice the double brackets # select first 3 columns based on column locations df.iloc[:, 0:4] # select columns 1, 2, 5 df.iloc[:, [1, 2, 5]] # drop a column df.drop("column_name", axis = 1) # create a list of all columns in a dataframe df.columns.tolist() # rename a column df.rename(columns = {"old_name": "new_name"}) # create a new column by multiplying an old column by 2 df["new_column_name"] = df["existing_column"] * 2 # sorting a column value in an ascending order df.sort_values(by = "column_name", ascending = True)
Row operations
Once you have taken care of the columns, next up rows. You need to work with rows for a number of reasons, most notably filtering or slicing data. In addition, you might want to add new observations in your dataframe or remove some existing ones. Below are some commands you’d need to filter data:
# select rows 3 to 10 df.iloc[3:10, ] # select 3 to 10 rows AND columns 2 to 4 df.iloc[3:10, 2:5] # take a random sample of 10 rows df.sample(10) # select rows with specific string df[df["colum_name"].isin(["Batman"])] # conditional filtering: filter rows with >5 df.query("column_name > 5")
Special case: preparing time series data
Time series is a different kind of object, not like any dataframe. Raw data is usually not formatted for time series analysis, so pandas library treats them like a normal dataframe where the time dimension is stored as strings rather than a datetime object. So you need to transform the ordinary dataframe to a time series object.
# convert Date column to a datetime object df["Date"] = pd.to_datetime(df["Date"]) # set Date as the index df = df.set_index("Date") # add new columns by splitting index df["Year"] = df.idex.year df["Month"] = df.index.month df["Weekday"] = df.index.weekday_name
For more about time series data preparation here is an article about it:
Final word
Everyone starts their data science journey from different starting points. However, the level of understanding and time it takes to reach the goal differ significantly because everyone takes a different learning path. Using Python for learning the fundamentals of data wrangling shouldn’t be hard if someone follows a logical learning process. In this article, I outlined that logical order — data imports, inspection, missing values, column operations, rows operations — with some most frequently used commands. I hope that was useful in your data science journey.
“Pandas Essentials For Data Science”– Mahbubul AlamTweet