Writing SQL Tests
Goal: Work with an unfamiliar dataset and be confident in the final analysis.

Setup
Let's image that we were just hired by a buzzy tech startup that's disrupting the streaming video market. At StreamBuster we believe that there's a billion dollar market for the game-changing convenience that comes from driving to a store, picking up a physical disk (if there's a good one in stock anyway) and driving home with it to enjoy with friends and family.Think of the thrill of the hunt, when a movie you want to watch is in stock. Think about how much accomplishment you'll feel about having "brought home the bacon" to an adoring spouse and children. Streaming is dead, people want StreamBuster stores.
As the new hot-shot analyst, your team has asked you to figure out: how much payment has the company received from rentals from the the last week? Today is 09-August-2005.
Get a copy of the StreamBuster database, so you can following along at home:Startup your database:curl \ --location \ --remote-header-name \ --remote-name \ https://github.com/robconery/dvdrental/raw/master/dvdrental.tar > dvdrental.tar
export PG_PASSWORD="password" export PG_PORT="5432" docker run -it --rm \ --publish "$PG_PORT":5432 \ --env POSTGRES_PASSWORD="$PG_PASSWORD" \ postgres
Load the data into your database:
docker run -it --rm \ --env PGPASSWORD="$PG_PASSWORD" \ --volume "$PWD":/work \ --workdir /work \ postgres pg_restore \ -h host.docker.internal \ -p "$PG_PORT" \ -U postgres \ --verbose \ --dbname postgres \ ./dvdrental.tar
Get to work
Let's get to work. It's a good practice to start with SQL Testing Framework, before you start writing queries:
mkdir queries
cd queries
git clone git@github.com:markjlorenz/sqltestingframework.git
Now we can connect to StreamBuster's database, and have a quick look around to see how they've structured their database:
docker run -it --rm \
--env PGPASSWORD="$PG_PASSWORD" \
--volume "$PWD":/work \
--workdir /work \
postgres psql \
-h host.docker.internal \
-p "$PG_PORT" \
-U postgres
postgres=# \dt
List of relations
Schema | Name | Type | Owner
--------+------------------+-------+----------
public | actor | table | postgres
public | address | table | postgres
public | category | table | postgres
public | city | table | postgres
public | country | table | postgres
public | customer | table | postgres
public | film | table | postgres
public | film_actor | table | postgres
public | film_category | table | postgres
public | inventory | table | postgres
public | language | table | postgres
public | payment | table | postgres
public | rental | table | postgres
public | staff | table | postgres
public | store | table | postgres
public | tmp_test_results | table | postgres
(16 rows)
It looks like a pretty nicely normalized database. Feeling good about the decision to join this startup, we are crushing it already. What infomration is available to us in the rental
table?
postgres=# \d rental
Table "public.rental"
Column | Type | Collation | Nullable | Default
--------------+-----------------------------+-----------+----------+-------------------------------------------
rental_id | integer | | not null | nextval('rental_rental_id_seq'::regclass)
rental_date | timestamp without time zone | | not null |
inventory_id | integer | | not null |
customer_id | smallint | | not null |
return_date | timestamp without time zone | | |
staff_id | smallint | | not null |
last_update | timestamp without time zone | | not null | now()
This will work. We should be able to go home early today, drink an IPA and trade x-men cards with our friends! Let's write a query in the queries/
directory called last-week-sales.sql
We can find the dollar amount of rentals by:
-- queries/last-week-sales.sql
--
SELECT
SUM(payment.amount)::money
FROM rental
JOIN payment
ON payment.rental_id = rental.rental_id
WHERE rental.rental_date > '2005-08-02' -- today is 09-August-2005
;
sum
------------
$21,767.03
(1 row)
Very nice. This company is doing great, and I'm sure your stock options will be worth billions. Before we hand this in to the CFO, let's use the SQL Testing Framework to sanity check our work.
We might want to know if a rental
can have more than one payment
. We don't think StreamBuster offers consumer financing, but that could be the next pivot. Let's use the SQL Testing Framework to check:
-- queries/test/last-week-sales.sql
--
WITH text AS (
SELECT 'Rentals can only have one payment' AS value
), expect AS (
SELECT 1 AS value
), actual AS (
SELECT
COUNT(rental_id) AS value
FROM payment
GROUP BY rental_id
ORDER BY COUNT(rental_id) DESC
LIMIT 1
)
:evaluate_test;
The SQL Tesing Framework has just a few things that it needs from us as users:
- A Common table expression (CTE) for
text
. This will be how the test is identified in the test report. - A CTE for
expect
. This is the value we expect to get. - A CTE for
actual
. The result ofacutal
will be compared for equality with the value ofexpect
. SQL Testing Framwork usesIS NOT DISTINCT FROM
, rather than=
for the check soNULL
can be used inexpect
oractual
. - A call to
:evaluate_test;
This is how you tell SQL Testing Framekwork that it should evaluate a test case on the CTEs above it.
$ cd ~/queries/test
$ ./runner.sh
.
actual | expect | did_pass | text
--------+--------+----------+-----------------------------------
5 | 1 | f | Rentals can only have one payment
(1 row)
Oh no! It looks like StreamBuster has been offering consumer credit. In some cases 5 payments were made for one rental. This is concerning, let's check a few more things. Can a payment happen before a rental?
WITH text AS (
SELECT 'Payments can not happen before rentals' AS value
), expect AS (
SELECT 0 AS value
), actual AS (
SELECT
COUNT(*) AS value
FROM rental
JOIN payment
ON payment.rental_id = rental.rental_id
WHERE payment.payment_date :< rental.rental_date
)
:evaluate_test;
$ ./runner.sh
.
actual | expect | did_pass | text
--------+--------+----------+----------------------------------------
5 | 1 | f | Rentals can only have one payment
0 | 0 | t | Payments can not happen before rentals
(2 rows)
Ok, good- at least StreamBuster isn't accepting pre-payments for rentals. I wonder how much time StreamBuster gives their customers to pay. The CEO mentioned something about cash flow at the All-Hands bar crawl.
WITH text AS (
SELECT 'Most renters make their last payment within 60 days' AS value
), expect AS (
SELECT 60 AS value
), payment_periods AS (
SELECT
rental.rental_id,
payment.payment_id,
payment.payment_date - rental.rental_date AS payment_period
FROM rental
JOIN payment
ON payment.rental_id = rental.rental_id
), actual AS (
SELECT
EXTRACT(day FROM AVG(payment_period)) AS value
FROM payment_periods
)
:evaluate_test;
$ ./runner.sh
.
actual | expect | did_pass | text
--------+--------+----------+-----------------------------------------------------
5 | 1 | f | Rentals can only have one payment
0 | 0 | t | Payments can not happen before rentals
607 | 60 | f | Most renters make their last payment within 60 days
4554 | 0 | f | Rentals can not happen in the future
(3 rows)
This is not good. It looks like on average, it's taking renters 607 days to make their last payment. This startup has fantastic revenue, but there probably isn't enough cash in the bank to pay the sea shanty quitntet that was in the office yestereday.
Let's check one more thing before handing in our resignation. In the original query, we looked at revenue in the last week with WHERE rental.rental_date > '2005-08-02' -- today is 09-August-2005
, assuming that rentals in the future are not possible. Let's check that:
\set today '\'2005-08-09\'::date'
WITH text AS (
SELECT 'Rentals can not happen in the future' AS value
), expect AS (
SELECT 0 AS value
), actual AS (
SELECT
COUNT(*) AS value
FROM rental
WHERE rental.rental_date > :today
)
:evaluate_test;
\unset today
$ ./runner.sh
.
actual | expect | did_pass | text
--------+--------+----------+-----------------------------------------------------
5 | 1 | f | Rentals can only have one payment
0 | 0 | t | Payments can not happen before rentals
607 | 60 | f | Most renters make their last payment within 60 days
4554 | 0 | f | Rentals can not happen in the future
(4 rows)
Since rentals are happening in the future, we need to re-think our original query. If we want to know how many payments were recieved for rentals in the last week, well need to change our query like this:
\set today '\'2005-08-09\'::date'
SELECT
SUM(payment.amount)::money
FROM rental
JOIN payment
ON payment.rental_id = rental.rental_id
WHERE rental.rental_date BETWEEN :today - '7 days'::interval AND :today
;
sum
-----------
$2,726.57
(1 row)
This is significantly less exciting than we had originally thought ($21,767.03
), and we should probably ask the CFO how they want to handle payments on rental that are being made in the future 🤷♂️. It's not obvoius what the GAAP rules are for that!
SQL Testing Framework has given us a simple workflow to check the assumptions we made about the dataset and prevented us from delivering a wildly over-inflated payments number!
The code for this example is available on Github