How to derive summary statistics using PostgreSQL

Deriving summary statistics of numerical and categorical fields using PostgreSQL

KSV Muralidhar
Towards Data Science

--

Photo by Campaign Creators on Unsplash

In this article, we’ll discuss how to derive summary statistics of numerical and categorical columns/fields using SQL. We’ll use the Netflix Movies and TV Shows dataset that’s downloaded from Tableau Public sample datasets (https://public.tableau.com/en-us/s/resources and go to Sample Data). This dataset consists of TV shows and movies available on Netflix as of 2019 and is sourced by Tableau from Kaggle and the updated version of the dataset can be found on Kaggle. However, for this article, we’ll use the older version available in Tableau sample datasets and import it into Postgres. This article assumes that the reader has basic knowledge of SQL queries and statistics. Below is a snapshot of the netflix_titles that we’ll be using throughout this article.

netflix_titles table (Image by author)

Summary statistics of numerical variables

Frequently used summary statistics for numerical variables are mean, median, minimum, maximum, range, standard deviation, variance, Q1, Q3, IQR and skewness.

Mean

In PostgreSQL, the mean of a numerical field/column is computed using the AVG() function. We’ll compute the mean of duration_minutes field as shown below.

SELECT 
AVG(duration_minutes) AS mean
FROM netflix_titles;
Image by author

Median

In PostgreSQL, there is no function to directly compute the median of a numerical field/column. However, since median is the 50th percentile, we can use it as a proxy to median. Percentile of a numerical variable is computed using the PERCENTILE_CONT() function. We’ll compute the median of duration_minutes field as shown below.

SELECT 
PERCENTILE_CONT
(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median
FROM netflix_titles;
Image by author

WITHIN GROUP clause creates an ordered subset of data that can be used to perform aggregations. PERCENTILE_CONT takes the percentile required as an argument, in this case it is 0.5 i.e. the 50th percentile.

Minimum

In PostgreSQL, the minimum value of a numerical field/column is found using the MIN() function. We’ll find the minimum value of duration_minutes field as shown below.

SELECT 
MIN(duration_minutes) AS min
FROM netflix_titles;
Image by author

Maximum

In PostgreSQL, the maximum value of a numerical field/column is found using the MAX() function. We’ll find the maximum value of duration_minutes field as shown below.

SELECT 
MAX(duration_minutes) AS max
FROM netflix_titles;
Image by author

Range

In PostgreSQL, there is no function to directly compute the range of a numerical field/column. However, since range is the difference between maximum and minimum values, we can use it as a proxy to range.

SELECT 
MAX(duration_minutes) - MIN(duration_minutes) AS range
FROM netflix_titles;
Image by author

Standard deviation

In PostgreSQL, the standard deviation of a numerical field/column is computed using the STDDEV() function. We’ll compute the standard deviation of duration_minutes field as shown below.

SELECT 
ROUND(STDDEV(duration_minutes), 2) AS standard_deviation
FROM netflix_titles;
Image by author

OR

We can also compute standard deviation as the square root of variance as shown below.

SELECT 
ROUND(SQRT(VARIANCE(duration_minutes)), 2) AS stddev_using_variance
FROM netflix_titles;
Image by author

Variance

In PostgreSQL, the variance of a numerical field/column is computed using the VARIANCE() function. We’ll compute the variance of duration_minutes field as shown below.

SELECT 
ROUND(VARIANCE(duration_minutes), 2) AS variance
FROM netflix_titles;
Image by author

OR

We can also compute variance as the square of standard deviation as shown below.

SELECT 
ROUND(POWER(STDDEV(duration_minutes), 2), 2) AS variance_using_stddev
FROM netflix_titles;
Image by author

Q1

In PostgreSQL, there is no function to directly compute the first quartile (Q1) of a numerical field/column. However, since Q1 is the 25th percentile, we can use it as a proxy to Q1. We’ll compute the Q1 of duration_minutes field as shown below.

SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1
FROM netflix_titles;
Image by author

Q3

In PostgreSQL, there is no function to directly compute the third quartile (Q3) of a numerical field/column. However, since Q3 is the 75th percentile, we can use it as a proxy to Q3. We’ll compute the Q3 of duration_minutes field as shown below.

SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
FROM netflix_titles;
Image by author

IQR

In PostgreSQL, there is no function to directly compute the interquartile range (IQR) of a numerical field/column. However, since IQR is the difference between Q3 and Q1, we can use it as a proxy to IQR. We’ll compute the IQR of duration_minutes field as shown below.

SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS iqr
FROM netflix_titles;
Image by author

Skewness

In PostgreSQL, there is no function to directly compute the skewness of a numerical field/column. However, since skewness is 3 * (mean-median) / standard_deviation, we can use it as a proxy to skewness. We’ll compute the skewness of duration_minutes field as shown below. To keep the code more readable we’ll use CTE to compute the mean, median and standard deviation.

WITH mean_median_sd AS
(
SELECT
AVG(duration_minutes) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
STDDEV(duration_minutes) AS stddev
FROM netflix_titles
)
SELECT
ROUND(3 * (mean - median)::NUMERIC / stddev, 2) AS skewness
FROM mean_median_sd;
Image by author

Putting it all together

We’ll put all the functions discussed earlier together and return them in a single table. In the below query, we’ll create two CTEs namely summary_stats (which returns the previously discussed statistics one per each column) and row_summary_stats (which returns the statistics one per each row).

WITH RECURSIVE
summary_stats AS
(
SELECT
ROUND(AVG(duration_minutes), 2) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
MIN(duration_minutes) AS min,
MAX(duration_minutes) AS max,
MAX(duration_minutes) - MIN(duration_minutes) AS range,
ROUND(STDDEV(duration_minutes), 2) AS standard_deviation,
ROUND(VARIANCE(duration_minutes), 2) AS variance,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
FROM netflix_titles
),
row_summary_stats AS
(
SELECT
1 AS sno,
'mean' AS statistic,
mean AS value
FROM summary_stats
UNION
SELECT

2,
'median',
median
FROM summary_stats
UNION
SELECT

3,
'minimum',
min
FROM summary_stats
UNION
SELECT
4,
'maximum',
max
FROM summary_stats
UNION
SELECT

5,
'range',
range
FROM summary_stats
UNION
SELECT
6,
'standard deviation',
standard_deviation
FROM summary_stats
UNION
SELECT
7,
'variance',
variance
FROM summary_stats
UNION
SELECT
9,
'Q1',
q1
FROM summary_stats
UNION
SELECT
10,
'Q3',
q3
FROM summary_stats
UNION
SELECT
11,
'IQR',
(q3 - q1)
FROM summary_stats
UNION
SELECT
12,
'skewness',
ROUND(3 * (mean - median)::NUMERIC / standard_deviation, 2) AS skewness
FROM summary_stats
)
SELECT *
FROM row_summary_stats
ORDER BY sno;
Image by author

Summary statistics of categorical variables

Frequently used summary statistics for categorical variables are mode, cardinality (number of unique categories), frequency and relative frequency.

Mode

In PostgreSQL, mode of a categorical field/column is computed using the MODE() function. We’ll compute the mode of rating field as shown below.

SELECT 
MODE() WITHIN GROUP (ORDER BY rating) AS mode
FROM netflix_titles;
Image by author

Cardinality

In PostgreSQL, there is no function to directly compute the number of unique values of a categorical field/column. However, we can count the distinct values of a variable as shown below.

SELECT 
COUNT(DISTINCT rating) AS cardinality
FROM netflix_titles;
Image by author

Frequency and relative frequency

In Postgres, we can compute the frequency of the categories in a categorical variable using GROUP BY and COUNT. For relative frequency, we’ll use a CTE to count the total number of values in rating field and use it to compute the relative frequency. Since, not all databases support window functions we’ll use CTE. We’ll also discuss how to calculate relative frequency using window functions.

WITH total_count AS
(
SELECT
COUNT(rating) AS total_cnt
FROM netflix_titles
)
SELECT
rating,
COUNT(rating) AS frequency,
ROUND(COUNT(rating)::NUMERIC /
(SELECT total_cnt FROM total_count), 4) AS relative_frequency
FROM netflix_titles
GROUP BY rating
ORDER BY frequency DESC;
Image by author

In the above example, we’ve created a CTE that captures the count of values in the rating field. We then used it to calculate the percentage/relative frequency of each category in the rating field. Since, Postgres supports window functions, we’ll see a simpler way of calculating relative frequency using window functions. We’ll use OVER() to compute the sum of the counts of rating across each group which in turn is the total number of values in the rating field.

SELECT 
rating,
COUNT(rating) AS frequency,
ROUND(COUNT(rating)::NUMERIC / SUM(COUNT(rating)) OVER(), 4) AS relative_frequency
FROM netflix_titles
GROUP BY rating
ORDER BY frequency DESC;
Image by author

This brings the article to an end. We’ve discussed various functions to compute the frequently used summary statistics of numerical fields like mean, median, min, max, Q1, Q2, IQR, etc. and of categorical fields like cardinality, frequency, etc. Summary statistics can be used to get an idea of the data distribution and also aid in identifying potential outliers.

Know more about my work at https://ksvmuralidhar.in/

--

--

Data Science | ML | DL | NLP | CV | Web scraping | Kaggler | Python | SQL | Excel VBA | Tableau | About Me: https://ksvmuralidhar.in/