This blog post was originally intended to be a side-note in my Pandas Join vs. Merge post. But it turned out to be long enough to warrant its own post (and way too verbose for a side-note). It’s not meant to be a full-on primer on SQL joins, but rather an example to help those new to SQL and relational databases begin to grasp what it means to join 2 tables.
Why Do We Join?
Why bother with joining at all? Can’t we just dump everything into a spreadsheet and sort things out there? Perhaps… but it would be incredibly time consuming, tedious, and error prone.
Relational databases are designed to be joined. Each table in the database contains data of a specific form or function. For example, one table might have basic data on a company’s customers such as customer ID (a unique ID that can be used to identify each customer), name, age, gender, date of first purchase, and address. While a separate much larger table stores detailed transaction level data — transaction ID, date of transaction, customer ID, product category, product ID, units sold, and price.
A given customer (or customer ID) could have hundreds or even thousands of transactions, so it would be extremely redundant to store that customer’s basic information over and over again for each row in the transactions table. The transactions table should be only for data relevant to transactions. Having too much overlapping data between tables is wasteful and can negatively impact system performance.
But that doesn’t mean we don’t care about the linkages between tables. Given how specific each table is, analyses that involve only a single table are generally not useful. The interesting analyses come from datasets that combine multiple tables. For example, we might want to segment transactions by age or geography. To do this, we would need data from both tables. And that’s where join comes in.
How Do We Join?
When we join two tables, we are linking them together via a selected characteristic. Let’s say we have two tables. The first one, Employee, lists out an employee’s unique ID number, name, and job title. The second one, Sale, lists out data on who made what sale by attaching the employee’s ID number and the units sold to a unique sales number:
SELECT * FROM Employee SELECT * FROM Sale
Our 2 tables, Employee (left) and Sale (right)
(I omitted the underscores from the column names in my graphics for legibility)
Now let’s join the two tables. To link the two tables, we need to pick a column (or combination of columns) that serves as the point of intersection — let’s call the chosen column the join index. Table entries that share the same value for the join index are joined together. Note that the intersection does not have to be one to one. For example, Tony has made 2 sales, so upon joining the tables, both of his sales will be linked to Tony (a.k.a. Employee ID 1).
When we join tables, we generally want the join index to be unique. If the join index were not unique, quirky stuff might occur. For example, let’s say we had a second employee named Tony (along with the tables below), and he was a megastar salesman. If instead of joining on “Employee ID” we joined on “Name”, then we would mistakenly link Tony the Megastar’s sales to me, making my bonus way too high:
Joining on non-unique columns is not recommended
And Tony the Megastar would get credit for my pitiful sales as well (not that he needs it). So to avoid this, we join on a column with unique values such as “Employee ID” (I removed Tony the Megastar as he was only there to illustrate what not to do, and his incredible successes made me feel unworthy):
The “Employee ID” column provides the link between the 2 tables
There are various types of SQL joins and I will not go into the details of all of them here. In this example, we will use a left join, meaning that we prioritize the rows in the left table. So our output will include every row in the left table (the one with “Name” and “Title”) regardless of whether there is a match with the right table — thus employees that have not made a sale will still have a row in our output, but there will be no values (NULLs to be exact) for the “Sale Number” and “Units Sold” columns.
Let’s take a look at our output (we are selecting only “Sale Number” and “Units Sold” from the right table and sorting by “Employee ID”):
SELECT e.*, s.Sale_Number, s.Units_Sold FROM Employee as e LEFT JOIN Sale as s ON e.Employee_ID=s.Employee_ID ORDER BY e.Employee_ID
The result of our left join
The output of our join now includes data from both tables. Tony’s sales data has been linked with his employee data (thanks to Employee ID) as has Lisa’s. Notice 2 things:
- It looks a bit repetitive because “Employee ID”, “Name”, and “Title” are repeated for as many times as the employee has sales. In reality, we wouldn’t stop here though. Next, we would most likely do a group by in order to count up how many sales each employee made, or calculate the average number of units sold each time an employee makes a sale.
- Employee ID 3 is missing from the output because we did a left join and there was no entry for Employee ID 3 in the left table. Thus, it was omitted.
And that concludes our brief example. Hopefully, this gives you a rudimentary idea of why we need joins and how they work.
- Database tables generally contain very specific information. Therefore, meaningful analyses usually combine data from multiple tables.
- This is accomplished via the join operation, which combines two tables by matching them up based on a specified column.
- The column used to combine the tables should contain only unique values.
- There are various types of joins. The one in this example is a left join, which returns every row in the left table whether or not there is a match.
“SQL Joins: A Brief Example Understand The Why And How Of SQL Joins”– Tony Yiu Tweet