Pandas vs SQL. When Data Scientists Should Use One Over the Other

Matt Przybyla
Jul 23, 2021


A deep dive into the benefits of each tool


Table of Contents

  1. Introduction
  2. Pandas
  3. SQL
  4. Summary
  5. References

Introduction


Both of these tools are important to not only data scientists, but also to those in similar positions like data analytics and business intelligence. With that being said, when should data scientists specifically use pandas over SQL and vice versa? In some situations, you can get away with just using SQL, and some other times, pandas is much easier to use, especially for data scientists who focus on research in a Jupyter Notebook setting. Below, I will discuss when you should use SQL and when you should use pandas. Keep in mind that both of these tools have specific use cases, but there are many times where their functionality overlap, and that is what I will be comparing below as well.

Pandas



Photo by Kalen Kemp on Unsplash [2].

Pandas
[3] is an open-source data analysis tool in the Python programing language. The benefit of pandas starts when you already have your main dataset, usually from a SQL query. This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

It is important to note that I am not comparing what Pandas does that SQL cannot do and vice versa. I will be picking the tool that can do the function more efficiently or preferable for data science work — in my opinion, from personal experience.

Here are times where using pandas is more beneficial than SQL — while also having the same functionality as SQL:

  • creating calculated fields from existing features

When incorporating a more complex SQL query, you often are incorporating subqueries as well in order to divide values from different columns. In pandas you can simply divide features much easier like the following:

df["new_column"] = df["first_column"]/df["second_column"]

The code above is showing how you can divide two separate columns, and assign those values to a new column — in this case, you are performing the feature creation on the whole entire dataset or dataframe. You can use this function in both feature exploration and feature engineering in the process of data science.

  • grouping by

Also referring to subqueries, grouping by in SQL can become quite complex and require lines and lines of code that can be visually overwhelming. In pandas, you can simply group by one line of code. I am not referring to the group by at the end of a simple select from table query, but one where there are multiple subqueries involved.

df.groupby(by="first_column").mean()

This result would be returning the mean of the first_column for every column in the dataframe. There are many other ways to use this grouping function, of which are outlined nicely in the pandas documentation linked below.

  • checking data types

In SQL, you will often have to cast types, but sometimes it can be a little clearer to see the way pandas lays out data types in a vertical format, rather than scrolling through a horizontal output in SQL. You can expect some examples of data types returned to be int64, float64, datetime64[ns], and object.

df.dtypes

While these are all fairly simple functions of pandas and SQL, in SQL, they are particularly tricky, and sometimes just much easier to implement in a pandas dataframe. Now, let’s look at what SQL is better at performing.

SQL



Photo by Caspar Camille Rubin on Unsplash [4].

SQL is probably the language that is used most by the most amount of different positions. For example, a data engineer could use SQL, a Tableau developer, or a product manager. With that being said, data scientists tend to use SQL frequently. It is important to note that there are several different versions of SQL, usually all having a similar function, just slightly formatted differently.

Here are times where using SQL is more beneficial than pandas — while also having the same functionality as pandas:
  • WHERE clause

This clause in SQL is used frequently and can also be performed in pandas. In pandas, however, it is slightly more difficult, or less intuitive. For example, you have to write out redundant code, whereas in SQL, you simply need the WHERE.

SELECT ID
FROM TABLE
WHERE ID > 100

In pandas, it would be something like:

df[df["ID"] > 100]["ID"]

Yes, both are simple, one is just a little more intuitive.

  • JOINS

Pandas has a few ways to join, which can be a little overwhelming, whereas in SQL you can perform simple joins like the following: INNER, LEFT, RIGHT

SELECT
one.column_A,
two.column_B
FROM FIRST_TABLE one
INNER JOIN SECOND_TABLE two on two.ID = one.ID

In this code, joining is slightly easier to read, than in pandas, where you have to merge dataframes, and especially as you merge more than two dataframes, it can be quite complex in pandas. SQL can perform multiple joins whether it be INNER, etc., all in the same query.

All of these examples, whether it be SQL or pandas, can be used in at least the exploratory data analysis portion of the data science process, as well as in feature engineer, and querying model results once they are stored in a database.

Summary


This comparison of pandas versus SQL is more of a personal preference. With that being said, you may feel the opposite of my opinion. However, I hope it still sheds light on the differences between pandas and SQL, as well as what you can perform the same in both tools, using slightly different coding techniques and a different language altogether.

To summarize, we have compared the benefits of using pandas over SQL and vice versa for a few of their shared functions:
* creating calculated fields from existing features
* grouping by
* checking data types
* WHERE clause
* JOINS

I hope you found my article both interesting and useful. Please feel free to comment down below if you agree with these comparisons — why or why not? Do you think one tool, in particular, is better than the other? What other data science tools can you think of that would have a similar comparison? What other functions of pandas and SQL could we compare?

“Pandas vs SQL. When Data Scientists Should Use One Over the Other”
– Matt Przybyla twitter social icon Tweet


Share this article:

0 Comments

Post a comment
Log In to Comment

Related Stories

Sep 25, 2021

10 Highly Probable Data Scientist Interview Questions

The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data s...

Soner Yıldırım
By Soner Yıldırım
Sep 17, 2021

5 Google Chrome Extensions Every Data Scientist Should Know About

In this new post we will talk about the best Google Chrome extensions that as data scientists make certain tasks easier for us. You should at least...

Daniel Morales
By Daniel Morales
Sep 10, 2021

Data Scientists are Really Just Product Managers. Here’s Why.

Unpopular opinion?Table of ContentsIntroductionBusiness and Product UnderstandingStakeholder CollaborationSummaryReferencesIntroductionAs mentioned...

Matt Przybyla
By Matt Przybyla
Icon

Join our private community in Slack

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!

 
We'll send you an invitational link to your email immediatly.
arrow-up icon