Merging tables using SQL

This article discusses about merging multiple tables by rows and columns using SQL along with a few examples

KSV Muralidhar
Towards Data Science

--

Image by author

In practice, it is very rare to have an SQL query involving a single table. We may need to merge multiple tables by rows (records) or columns (fields) to get the desired result. In this article, we’ll discuss the operators/commands in SQL that enable use to merge tables by rows or columns.

Merging tables by columns

Multiple tables can be merged by columns in SQL using joins. Joins merge two tables based on the specified columns (generally, the primary key of one table and a foreign key of the other). Below is the generic syntax of SQL joins.

SELECT 
*
FROM table_1
JOIN table_2
USING (id);

In the above syntax, table_1 and table_2 are the two tables with the key column (matching column in both the tables), id. We use the keyword USING only when the key column has the same name in both the tables. Otherwise, we need to explicitly mention the key columns of both the tables as shown below.

SELECT 
*
FROM table_1 t1
JOIN table_2 t2
ON t1.t1_id = t2.t2_id;

In the above syntax, t1 is an alias of table_1 and t2 is of table_2. When the names of the key columns are not same in both the tables, we need to match them using the ON keyword as shown above. We’ll now discuss a few important joins in SQL.

Inner Join

Inner join merges two tables by columns and returns only the matching records (based on the specified columns) in both the tables. In the below query result, we can see that only the records with common id in both left_table and right_table are returned.

SELECT 
*
FROM left_table
INNER JOIN right_table
USING (id);

Or

SELECT 
*
FROM left_table l
INNER JOIN right_table r
ON l.id = r.id;
Inner Join (Image by author)

Left Join

Left join merges two tables by columns and returns all the records in the left table but only the matching records (based on the specified columns) from the right table. In the below query result, we can see the records with common id in both the tables along with all the records of the left_table. Records in the right_table with no matching id in the left_table have NULL.

SELECT 
*
FROM left_table
LEFT JOIN right_table
USING (id);

Or

SELECT 
*
FROM left_table l
LEFT JOIN right_table r
ON l.id = r.id;
Left Join (Image by author)

Right Join

Right join merges two tables by columns and returns all the records in the right table but only the matching records (based on the specified columns) from the left table. In the below query result, we can see the records with common id in both the tables along with all the records of the right_table. Records in the left_table with no matching id in the right_table have NULL.

SELECT 
*
FROM left_table
RIGHT JOIN right_table
USING (id);

Or

SELECT 
*
FROM left_table l
RIGHT JOIN right_table r
ON l.id = r.id;
Right Join (Image by author)

Full Join

Full join can be considered as a combination of left and right joins. Full join merges two tables by columns and returns all the records in both the left and right tables. In the below query result, we can see that all the records of both the tables are returned. Records with no matching id in the other table have NULL.

SELECT 
*
FROM left_table
FULL JOIN right_table
USING (id);

Or

SELECT 
*
FROM left_table l
FULL JOIN right_table r
ON l.id = r.id;
Full Join (Image by author)

Cross Join

Cross join returns the cartesian product of two tables. Cartesian product of two sets A = {1, 2}, B = {3, 4} is A x B = {(1, 3), (1, 4), (2, 3), (2, 4)}. We need not specify a key column in cross joins.

SELECT 
*
FROM left_table
CROSS JOIN right_table
Cross Join (Image by author)

Semi Join

Semi join is technically not an SQL join but works like a join. Semi join returns the matching records in the left table based on a key column in the right table. Semi join doesn’t include the columns of the right table in the query result. In the below example, we want to return the records from the left_table with matching id in the right_table. In other words, we want the records in the left_table whose id is present in the right_table.

SELECT 
*
FROM left_table
WHERE
id IN
(
SELECT id FROM right_table
)
Semi Join (Image by author)

Anti Join

Anti join is also technically not an SQL join but works like a join. Anti join returns the non-matching records in the left table based on a key column in the right table. Anti join also doesn’t include the columns of the right table in the query result. In the below example, we want to return the records from the left_table whose id doesn’t match with the id of the right_table. In other words, we want the records in the left_table whose id is not present in the right_table.

SELECT 
*
FROM left_table
WHERE
id
NOT IN
(
SELECT id FROM right_table
)
Anti Join (Image by author)

Self Join

Self join enables us to join a table with itself. In the below query, we need to find the records with the same left value. For this, we have joined the table with itself and filtered the records with same left value but different id.

SELECT 
*
FROM left_table l1, left_table l2
WHERE
l1.left = l2.left
AND
l1.id <> l2.id
ORDER BY l1.left
Self join (Image by author)

Merging tables by rows

Union

Union merges two tables by rows, provided the data types of the columns of one table matches with that of the other. We cannot merge a table having column data types as integer and text with a table having column data types as text and integer. However, we can merge two tables even if the column names of one table doesn’t match with that of the other. Union returns only the unique records of both the tables.

(
SELECT
*
FROM left_table
)
UNION
(
SELECT
*
FROM right_table
)
Union (Image by author)

Union All

Similar to Union, Union All also merges tables by rows. Unlike Union, Union All retains the duplicate records of both the tables. In the below query result, we have merged the id of left_table and right_table. We can see a few duplicates in the result.

(
SELECT
id
FROM left_table
)
UNION ALL
(
SELECT
id
FROM right_table
)
Union All (Image by author)

Intersect

Intersect returns the common records of both the tables. In the below query result, we can see the common ids of left_table and right_table.

(
SELECT
id
FROM left_table
)
INTERSECT
(
SELECT
id
FROM right_table
)
Intersect (Image by author)

EXCEPT

Except returns the records from the first table (left table) which are not present in the second table (right table). In the below query result, we can see the ids of left_table which aren’t present in the right_table.

Except (Image by author)

Example Queries

We’ll use the dvd_rental database downloaded from here and restore it. Below is the documentation to restore a database in PostgreSQL.

1. Top 5 frequent renters

In this example, we need to find the top 5 customers who rented the most. For this, we’ll

  1. Join the customer and rental tables using customer_id.
  2. Count the customers (as rental_count) by grouping customer_id.
  3. Sort the result according to rental_count in descending order.
  4. Limit the results to first 5 records.
SELECT 
c.customer_id,
c.first_name,
c.last_name,
COUNT(c.customer_id) AS rental_count
FROM customer c
INNER JOIN rental r
USING (customer_id)
GROUP BY customer_id
ORDER BY
COUNT(c.customer_id) DESC
LIMIT 5;
Image by author

2. Top & bottom 5 customers by revenue generated

In this example, we’ll use common table expressions (CTE). With CTEs, we can create temporary table that exist for a particular query. Below is the official Postgres documentation on CTEs.

In this example, we need to find out top and bottom 5 customers who generated the most revenue. For this, we’ll

1. Create a CTE named revenue_per_customer by

  • Joining the customer and rental tables using customer_id.
  • Joining the resultant table with payment table using rental_id.
  • Computing the total amount paid by customers for each rental transaction (as total_amount) grouping by customer_id.
  • Finally, selecting the customer_id, first_name, last_name and total_amount.

2. Select top 5 customers by revenue from the above CTE by

  • Sorting total_amount in the revenue_per_customer (CTE result) in descending order.
  • Limiting the result to first 5 records.
  • Adding a comment specifying the records as ‘Top 5’.

3. Select bottom 5 customers by revenue from the above CTE by

  • Sorting total_amount in the revenue_per_customer (CTE result) in ascending order.
  • Limiting the result to first 5 records.
  • Adding a comment specifying the records as ‘Bottom 5’.

4. Merging the above two results using UNION.

WITH revenue_per_customer AS 
(SELECT
c.customer_id,
c.first_name,
c.last_name,
SUM(p.amount) AS "total_amount"
FROM customer c
INNER JOIN rental r
USING (customer_id)
INNER JOIN payment p
USING (rental_id)
GROUP BY c.customer_id)
(SELECT
*,
'Top 5' AS comment
FROM revenue_per_customer
ORDER BY total_amount DESC
LIMIT 5)
UNION(SELECT
*,
'Bottom 5' AS comment
FROM revenue_per_customer
ORDER BY total_amount ASC
LIMIT 5)
ORDER BY comment DESC, total_amount DESC;
Image by author

We can also get the above query result using window functions. Below is the official Postgres documentation on window functions.

To find out the top and bottom 5 customers who generated the most revenue using window functions, we’ll

1. Create a CTE named total_amt_rank by

  • Joining the customer and rental tables using customer_id.
  • Joining the resultant table with payment table using rental_id.
  • Computing the total amount paid by customers for each rental transaction (as total_amount) grouping by customer_id.
  • Finally, selecting the customer_id, first_name, last_name, total_amount and rank of total_amount (as total_amount_rank) by sorting it in descending order. This gives rank 1 to the highest amount and so on.

2. Select top 5 customers by revenue by selecting the customers whose total_amount_rank is BETWEEN 1 and 5 from the above CTE.

3. Select bottom 5 customers by revenue from the above CTE by

  • Sorting total_amount_rank in the total_amt_rank (CTE result) in descending order.
  • Limiting the result to first 5 records.

4. Merging the above two results using UNION.

WITH total_amt_rank AS
(
SELECT
c.customer_id,
c.first_name,
c.last_name,
SUM(p.amount) AS "total_amount",
RANK() OVER (ORDER BY SUM(p.amount) DESC) AS total_amount_rank
FROM customer c
INNER JOIN rental r
USING (customer_id)
INNER JOIN payment p
USING (rental_id)
GROUP BY c.customer_id
)
(
SELECT *
FROM total_amt_rank
WHERE
total_amount_rank BETWEEN 1 AND 5
)

UNION

(
SELECT *
FROM total_amt_rank
ORDER BY total_amount_rank DESC
LIMIT 5
)

ORDER BY total_amount_rank;
Image by author

3. Top 5 countries having the highest rentals

In this example, we need to find the top 5 countries with the highest rentals. For this, we’ll

  1. Join the country and city tables using country_id.
  2. Join the resultant table with address table using city_id.
  3. Join the resultant table with customer table using address_id.
  4. Join the resultant table with rental table using customer_id.
  5. Count the country_id (as rental_count) by grouping country_id. We may also use rental_id to get rental_count.
  6. Sort the result by rental_count in descending order.
  7. Limit the results to 5 records.
SELECT 
co.country_id,
co.country,
COUNT(co.country_id) AS rental_count
FROM country co
INNER JOIN city ci
USING (country_id)
INNER JOIN address a
USING (city_id)
INNER JOIN customer cu
USING (address_id)
INNER JOIN rental r
USING (customer_id)
GROUP BY co.country_id
ORDER BY
COUNT(co.country_id) DESC
LIMIT 5;
Image by author

There are a few addresses and cities with no customers. Using inner join omits such records. In the below query, we’ll look at how the result will include addresses without customers on using left join.

4. City/address with no customers?

There are a few cities and addresses without any customers (these may be store addresses). Using inner joins would have omitted them from the results as there are no matching entries in the other table. For example, a city named London in Canada has no matching city_id in the address table. Using inner join would have omitted London in Canada from the result. Similarly, four addresses in Canada and Australia have no matching address_id in the customer table.

SELECT 
co.country,
ci.city,
a.address,
cu.customer_id
FROM country co
LEFT JOIN city ci
USING (country_id)
LEFT JOIN address a
USING (city_id)
LEFT JOIN customer cu
USING (address_id)
WHERE cu.address_id IS NULL;
Image by author

5. Countries with no customer

In this example, we’ll find the countries with no customers by

1. Creating a subquery to find the countries with at least one customer by

  • Joining the country table with city table using country_id.
  • Joining the remainder table with address table using city_id.
  • Joining the remainder table with customer table using address_id.

2. Selecting country from country table where country_id is not present in the country_id of the above subquery.

SELECT 
country
FROM country
WHERE country_id
NOT IN
(
SELECT
co.country_id
FROM country co
INNER JOIN city ci
USING (country_id)
INNER JOIN address a
USING (city_id)
INNER JOIN customer
USING (address_id)
);
Image by author

6. Are there any stores in Australia?

In the above example, we saw that Australia has no customers. In this example, we’ll see are there any stores in Australia by

  1. Joining the country table with city table using country_id.
  2. Joining the resultant table with address table using city_id.
  3. Joining the resultant table with store table using address_id.
  4. Selecting records where store_id IS NOT NULL in Australia.

Left join ensures that countries with no cities and cities with no stores are also included in the query result.

SELECT 
st.store_id,
co.country,
ad.address
FROM country co
LEFT JOIN city ci
USING (country_id)
LEFT JOIN address ad
USING (city_id)
LEFT JOIN store st
USING (address_id)
WHERE
(st.store_id IS NOT NULL)
AND
(co.country = 'Australia');
Image by author

There is one store in Australia. In fact, there are just two stores in the whole database. We’ll view them using the below query.

SELECT * FROM store;
Image by author

7. Languages with no films

In this example, we’ll see if there are any languages with no films by

  1. Joining the language table with film table using language_id. The left join ensures languages without any films are also includes.
  2. Filtering records where film_id IS NULL.
SELECT 
*
FROM language l
LEFT JOIN film f
USING (language_id)
WHERE f.film_id IS NULL;
Image by author

We see a few languages with no films in the database. We’ll make sure that it’s not an error by selecting the films with language_id in (2,3,4,5,6) from the film table. The query result should return no records.

SELECT 
*
FROM film
WHERE language_id IN (2,3,4,5,6);
Image by author

8. Popularity of films by category in India

In this example, we’ll find the number of rentals per film category in India by joining the required tables as discussed in the earlier examples and

  1. Grouping by country and category and filtering records from India and counting the film category name (as film_category_count).
  2. Ordering the result by country in ascending order and film_category_count in descending order.
SELECT 
co.country,
cat.name AS film_category,
COUNT(cat.name) AS film_category_count
FROM country co
INNER JOIN city ci
USING (country_id)
INNER JOIN address ad
USING (city_id)
INNER JOIN customer cu
USING (address_id)
INNER JOIN rental re
USING (customer_id)
INNER JOIN inventory inv
USING (inventory_id)
INNER JOIN film fi
USING (film_id)
INNER JOIN film_category fc
USING (film_id)
INNER JOIN category cat
USING (category_id)
/*
Using
WHERE co.country = 'India'
here, instead of
HAVING co.country = 'India'
reduces the query execution time.
*/

GROUP BY (co.country, cat.name)
HAVING co.country = 'India'
ORDER BY
co.country ASC,
COUNT(cat.name) DESC;
Image by author

9. Films with only a single actor

In this example, we‘ll find the films with a single actor by

  1. Joining the film table with film_actor table using film_id.
  2. Grouping by film_id and counting the number of actors (as actor_count).
  3. Filtering records where actor_count is 1.
SELECT 
f.film_id,
f.title,
COUNT(fa.actor_id) AS actor_count
FROM film f
INNER JOIN film_actor fa
USING (film_id)
GROUP BY f.film_id
HAVING COUNT(fa.actor_id) = 1;
Image by author

10. Number of films of an actor by category

In this example, we’ll find the number of films of an actor by film category by

  1. Creating a CTE named actor_cat_cnt that returns the number of films for each actor_id and category_id.
  2. Joining the above CTE with category table using category_id.
  3. Joining the resultant table with actor table using actor_id.
  4. Sort actor name (concatenation of first_name and last_name) in ascending order and film_count in descending order.
WITH 
actor_cat_cnt AS
(
SELECT
fa.actor_id,
fc.category_id,
COUNT(f.film_id) AS film_count
FROM film_actor fa
INNER JOIN film f
USING (film_id)
INNER JOIN film_category fc
USING (film_id)
GROUP BY
fa.actor_id,
fc.category_id
)
SELECT
CONCAT(ac.first_name, ' ', ac.last_name) AS actor,
ca.name AS category,
film_count
FROM actor_cat_cnt
INNER JOIN category ca
USING (category_id)
INNER JOIN actor ac
USING (actor_id)
ORDER BY
CONCAT(ac.first_name, ' ', ac.last_name) ASC,
film_count DESC;
Image by author

11. Popular categories of an actor

In the above example, we found the number of films of an actor by film category. In this example, we’ll find the popular categories of an actor (i.e. the categories in which an actor has the most films) by

  1. Creating a CTE named actor_cat_cnt that returns the number of films for each actor_id and category_id and rank the categories of each actor by the count of films in descending order (as cat_rank).
  2. Joining the above CTE with category table using category_id.
  3. Joining the resultant table with actor table using actor_id.
  4. Filtering the records with cat_rank = 1.
  5. Sort actor name (concatenation of first_name and last_name) in ascending order and film_count in descending order.
WITH 
actor_cat_cnt AS
(
SELECT
fa.actor_id,
fc.category_id,
COUNT(f.film_id) AS film_count,
RANK() OVER
(PARTITION BY fa.actor_id
ORDER BY COUNT(f.film_id) DESC) AS cat_rank
FROM film_actor fa
INNER JOIN film f
USING (film_id)
INNER JOIN film_category fc
USING (film_id)
GROUP BY
fa.actor_id,
fc.category_id
)
SELECT
CONCAT(ac.first_name, ' ', ac.last_name) AS actor,
ca.name AS category,
film_count
FROM actor_cat_cnt
INNER JOIN category ca
USING (category_id)
INNER JOIN actor ac
USING (actor_id)
WHERE cat_rank = 1
ORDER BY
CONCAT(ac.first_name, ' ', ac.last_name) ASC,
film_count DESC;
Image by author

This brings this article to an end. We’ve discussed ways of merging tables by rows or columns using SQL along with a few examples using the dvd_rental database. These are the fundamental concepts that are used in almost every query we write in SQL. We may not frequently use a few of them in practice, but knowing them is necessary.

Know more about my work at https://ksvmuralidhar.in/

--

--

Data Science | ML | DL | NLP | CV | Web scraping | Kaggler | Python | SQL | Excel VBA | Tableau | About Me: https://ksvmuralidhar.in/