Writing SQL Tests

Goal: Work with an unfamiliar dataset and be confident in the final analysis.

Getting the wrong query result

Setup

Let's image that we were just hired by a buzzy tech startup that's disrupting the streaming video market. At StreamBuster we believe that there's a billion dollar market for the game-changing convenience that comes from driving to a store, picking up a physical disk (if there's a good one in stock anyway) and driving home with it to enjoy with friends and family.Think of the thrill of the hunt, when a movie you want to watch is in stock. Think about how much accomplishment you'll feel about having "brought home the bacon" to an adoring spouse and children. Streaming is dead, people want StreamBuster stores.

As the new hot-shot analyst, your team has asked you to figure out: how much payment has the company received from rentals from the the last week? Today is 09-August-2005.

Get a copy of the StreamBuster database, so you can following along at home:
curl \
  --location \
  --remote-header-name \
  --remote-name \
https://github.com/robconery/dvdrental/raw/master/dvdrental.tar > dvdrental.tar
Startup your database:
export PG_PASSWORD="password"
export PG_PORT="5432"

docker run -it --rm \
  --publish "$PG_PORT":5432 \
  --env POSTGRES_PASSWORD="$PG_PASSWORD" \
postgres
 

Load the data into your database:

docker run -it --rm \
  --env PGPASSWORD="$PG_PASSWORD" \
  --volume "$PWD":/work \
  --workdir /work \
postgres pg_restore \
  -h host.docker.internal \
  -p "$PG_PORT" \
  -U postgres \
  --verbose \
  --dbname postgres \
  ./dvdrental.tar

Get to work

Let's get to work. It's a good practice to start with SQL Testing Framework, before you start writing queries:

 mkdir queries
cd queries
git clone git@github.com:markjlorenz/sqltestingframework.git

Now we can connect to StreamBuster's database, and have a quick look around to see how they've structured their database:

docker run -it --rm \
  --env PGPASSWORD="$PG_PASSWORD" \
  --volume "$PWD":/work \
  --workdir /work \
postgres psql \
  -h host.docker.internal \
  -p "$PG_PORT" \
  -U postgres

postgres=# \dt

                List of relations
 Schema |       Name       | Type  |  Owner
--------+------------------+-------+----------
 public | actor            | table | postgres
 public | address          | table | postgres
 public | category         | table | postgres
 public | city             | table | postgres
 public | country          | table | postgres
 public | customer         | table | postgres
 public | film             | table | postgres
 public | film_actor       | table | postgres
 public | film_category    | table | postgres
 public | inventory        | table | postgres
 public | language         | table | postgres
 public | payment          | table | postgres
 public | rental           | table | postgres
 public | staff            | table | postgres
 public | store            | table | postgres
 public | tmp_test_results | table | postgres
(16 rows)

It looks like a pretty nicely normalized database. Feeling good about the decision to join this startup, we are crushing it already. What infomration is available to us in the rental table?

postgres=# \d rental

                                               Table "public.rental"
    Column    |            Type             | Collation | Nullable |                  Default
--------------+-----------------------------+-----------+----------+-------------------------------------------
 rental_id    | integer                     |           | not null | nextval('rental_rental_id_seq'::regclass)
 rental_date  | timestamp without time zone |           | not null |
 inventory_id | integer                     |           | not null |
 customer_id  | smallint                    |           | not null |
 return_date  | timestamp without time zone |           |          |
 staff_id     | smallint                    |           | not null |
 last_update  | timestamp without time zone |           | not null | now()

This will work. We should be able to go home early today, drink an IPA and trade x-men cards with our friends! Let's write a query in the queries/ directory called last-week-sales.sql We can find the dollar amount of rentals by:

-- queries/last-week-sales.sql
--
SELECT
  SUM(payment.amount)::money
FROM rental
JOIN payment
  ON payment.rental_id = rental.rental_id
WHERE rental.rental_date > '2005-08-02'    -- today is 09-August-2005
;

    sum
------------
 $21,767.03
(1 row)

Very nice. This company is doing great, and I'm sure your stock options will be worth billions. Before we hand this in to the CFO, let's use the SQL Testing Framework to sanity check our work.

We might want to know if a rental can have more than one payment. We don't think StreamBuster offers consumer financing, but that could be the next pivot. Let's use the SQL Testing Framework to check:

-- queries/test/last-week-sales.sql
--
WITH text AS (
  SELECT 'Rentals can only have one payment' AS value
), expect AS (
  SELECT 1 AS value
), actual AS (
  SELECT
    COUNT(rental_id) AS value
  FROM payment
  GROUP BY rental_id
  ORDER BY COUNT(rental_id) DESC
  LIMIT 1
)
:evaluate_test;

The SQL Tesing Framework has just a few things that it needs from us as users:

  • A Common table expression (CTE) for text. This will be how the test is identified in the test report.
  • A CTE for expect. This is the value we expect to get.
  • A CTE for actual. The result of acutal will be compared for equality with the value of expect. SQL Testing Framwork uses IS NOT DISTINCT FROM, rather than = for the check so NULL can be used in expect or actual.
  • A call to :evaluate_test; This is how you tell SQL Testing Framekwork that it should evaluate a test case on the CTEs above it.
$ cd ~/queries/test
$ ./runner.sh
.
 actual | expect | did_pass |               text
--------+--------+----------+-----------------------------------
 5      | 1      | f        | Rentals can only have one payment 
(1 row)

Oh no! It looks like StreamBuster has been offering consumer credit. In some cases 5 payments were made for one rental. This is concerning, let's check a few more things. Can a payment happen before a rental?

WITH text AS (
  SELECT 'Payments can not happen before rentals' AS value
), expect AS (
  SELECT 0 AS value
), actual AS (
  SELECT
    COUNT(*) AS value
  FROM rental
  JOIN payment
    ON payment.rental_id = rental.rental_id
  WHERE payment.payment_date :< rental.rental_date
      )
      :evaluate_test;
$ ./runner.sh
.
 actual | expect | did_pass |                  text
--------+--------+----------+----------------------------------------
 5      | 1      | f        | Rentals can only have one payment      
 0      | 0      | t        | Payments can not happen before rentals
(2 rows)

Ok, good- at least StreamBuster isn't accepting pre-payments for rentals. I wonder how much time StreamBuster gives their customers to pay. The CEO mentioned something about cash flow at the All-Hands bar crawl.

WITH text AS (
  SELECT 'Most renters make their last payment within 60 days' AS value
), expect AS (
  SELECT 60 AS value
), payment_periods AS (
  SELECT
    rental.rental_id,
    payment.payment_id,
    payment.payment_date - rental.rental_date AS payment_period
  FROM rental
  JOIN payment
    ON payment.rental_id = rental.rental_id
), actual AS (
  SELECT
    EXTRACT(day FROM AVG(payment_period)) AS value
  FROM payment_periods
)
:evaluate_test;
$ ./runner.sh
.
  actual | expect | did_pass |                        text
--------+--------+----------+-----------------------------------------------------
 5      | 1      | f        | Rentals can only have one payment                   
 0      | 0      | t        | Payments can not happen before rentals
 607    | 60     | f        | Most renters make their last payment within 60 days 
 4554   | 0      | f        | Rentals can not happen in the future                
(3 rows)

This is not good. It looks like on average, it's taking renters 607 days to make their last payment. This startup has fantastic revenue, but there probably isn't enough cash in the bank to pay the sea shanty quitntet that was in the office yestereday.

Let's check one more thing before handing in our resignation. In the original query, we looked at revenue in the last week with WHERE rental.rental_date > '2005-08-02' -- today is 09-August-2005, assuming that rentals in the future are not possible. Let's check that:

\set today '\'2005-08-09\'::date'
WITH text AS (
  SELECT 'Rentals can not happen in the future' AS value
), expect AS (
  SELECT 0 AS value
), actual AS (
  SELECT
    COUNT(*) AS value
  FROM rental
  WHERE rental.rental_date > :today
)
:evaluate_test;
\unset today
$ ./runner.sh
.
  actual | expect | did_pass |                        text
--------+--------+----------+-----------------------------------------------------
 5      | 1      | f        | Rentals can only have one payment                   
 0      | 0      | t        | Payments can not happen before rentals
 607    | 60     | f        | Most renters make their last payment within 60 days 
 4554   | 0      | f        | Rentals can not happen in the future                
(4 rows)

Since rentals are happening in the future, we need to re-think our original query. If we want to know how many payments were recieved for rentals in the last week, well need to change our query like this:

\set today '\'2005-08-09\'::date'
SELECT
  SUM(payment.amount)::money
FROM rental
JOIN payment
  ON payment.rental_id = rental.rental_id
WHERE rental.rental_date BETWEEN :today - '7 days'::interval AND :today
;

    sum
-----------
 $2,726.57
(1 row)

This is significantly less exciting than we had originally thought ($21,767.03), and we should probably ask the CFO how they want to handle payments on rental that are being made in the future 🤷‍♂️. It's not obvoius what the GAAP rules are for that!

SQL Testing Framework has given us a simple workflow to check the assumptions we made about the dataset and prevented us from delivering a wildly over-inflated payments number!

The code for this example is available on Github

Up next: Testing queries that don't write to the database. Learn how