Christopher Winslett | CrunchyData Blog

Temporal Joins

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 24 Oct 2025 09:00:00 EDT

My first thought seeing a temporal join in 2008 was, “Why is this query so complex?” The company I was at relied heavily on database queries, as it was a CRM and student success tracking system for colleges and universities. The query returned a filtered list of users and their last associated record from a second table. The hard part about the query isn’t returning the last timestamp or even performing joins, it’s returning only their last associated record from a second table.

Back in 2008, we didn’t have window functions or CTEs, so the query algorithm was a series of nested tables that looked like this:

SELECT
    *
FROM users, ( -- find the record for the last second_table by created_at and user_id
                SELECT
                    second_table.*
                FROM second_table, ( -- find the last second_table created_at per user_id
                                        SELECT
                                            user_id,
                                            max(created_at) AS created_at
                                        FROM second_table
                                        GROUP BY 1
                                    ) AS last_second_table_at
                WHERE
                    last_second_table_at.user_id = second_table.user_id
                    AND second_table.created_at = last_second_table_at.created_at
            ) AS last_second_table
WHERE users.id = last_second_table.user_id;

See the Sample Code section below for the schema and data to run these queries.

But, even that query was wrong because the second table may have records with duplicate created_at values. That was the source of a bug back in 2008 that resulted in duplicate rows being listed.

Obviously, we weren't using Postgres at the time because there has always been a simpler way to do this in Postgres using DISTINCT ON:

SELECT DISTINCT ON (u.id)
    u.id,
    u.name,
    s.created_at AS last_action_time,
    s.action_type
FROM users u
JOIN second_table s ON u.id = s.user_id
ORDER BY u.id, s.created_at DESC, s.id DESC;

Temporal joins require attention to detail.

Robust Solution: CTEs & Window Functions

Before we go too far into the topic, for those looking for a solution to their current problem, below is how I would write that query today if you aren't finding the first or last in a series. For these situations, we use CTEs and window functions, so there's no need to nest queries when we can abstract them for a cleaner purpose. Here is the template for the temporal joins that do not work with DISTINCT ON:

WITH max_second_table AS (
    SELECT
        *
    FROM (
            SELECT
                *,
                -- Use ROW_NUMBER() window function to return the latest record:
                -- The ORDER BY clause is critical:
                -- 1. ORDER BY created_at DESC finds the latest time.
                -- 2. ORDER BY id DESC serves as a reliable tie-breaker
                ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) AS row_order
            FROM second_table
        ) AS ordered_second_table
    WHERE row_order = 2
)

SELECT
    *
FROM users
LEFT JOIN max_second_table ON users.id = max_second_table.user_id;

In this example, we are joining the second occurrence (WHERE row_order = 2) in the second_table for a user. For the university example, we used these types of queries to report on progress over time by showing the 1st, 2nd, 3rd, and occurrences of events.

Is this actually less code than the first example? No, but it is compartmentalized with a cleaner purpose.

Also, introducing the primary key (id) in the ORDER BY clause provides the necessary tie-breaker logic for the sorting -- that is how we fixed the SQL issue in the opening text.

Problem with ORMs

Due to their query complexity, ORMs are generally not capable of handling temporal joins without complex manipulation. The ORM I'm most familiar with is ActiveRecord, part of the Ruby on Rails suite. When Rails developers encounter temporal joins, they typically resort to the N+1 query pattern from their application code like this:

users = User.all

users.each do |user|
  last_action = user.second_table.last
end

If you aren't running this query too frequently or over too many user records, this is generally performant enough. However, this approach becomes suboptimal for application performance as the user list grows because each iteration of the loop requires a network hop back and forth with the database and an object initialization in the application. While you can make ActiveRecord do this natively, the resulting code is often harder to read and maintain for the typical use case—a pattern you see in other ORMs as well.

Sample Code

Below is the sample SQL you can use to load data into your database to test a few of these queries. Note that Alice has two actions at the exact same timestamp to replicate the original bug scenario.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name TEXT
);

CREATE TABLE second_table (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action_type TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE
);

INSERT INTO users (name) VALUES ('Alice'), ('Bob'), ('Charlie');

-- Alice has two actions at the exact same timestamp (The 2008 bug scenario)
INSERT INTO second_table (user_id, action_type, created_at) VALUES
(1, 'login', '2023-10-01 10:00:00'),
(1, 'page_view', '2023-10-01 10:00:00'),
(2, 'purchase', '2023-10-02 11:00:00'),
(3, 'registration', '2023-10-03 12:00:00'),
(3, 'profile_update', '2023-10-04 13:00:00');

Conclusion

The term "temporal join" isn't a common piece of developer jargon, but the underlying pattern, retrieving the related record, is critical in reporting and analytics. It's a known pattern among people who have worked on a code base that relies heavily on SQL capabilties, typically for reporting.

Using the PostgreSQL feature DISTINCT ON for the simplest case, or CTEs with Window Functions for complex retrieval, we avoid the bugs of older SQL patterns and eliminate the performance penalty of the N+1 problem.

If you would like to learn more about advanced SQL patterns, check out our Postgres Playground.

Creating Histograms with Postgres

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 04 Apr 2025 10:00:00 EDT

Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.

While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres width_bucket is good-enough to counting data inside bins with minimal effort.

Postgres width_bucket for histograms

Given the number of buckets and max/min value, width_bucket returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: select width_bucket(43, 0, 100, 10) AS bucket; But 5 is not correct for 43, or is it?

You can see how the values would fall using generate_series (shown below using Metabase):

SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;

When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, width_bucket behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, width_bucket returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum <= value < maximum).

We can use the bucket value to generate more readable labels.

Auto-formatting histogram with SQL

Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within a CTE called formatted_data. We are doing it this way so that we can replace that query with new data in the future.

Here’s the beginning of the query (this is copy-pastable into Postgres):

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
)
SELECT
  WIDTH_BUCKET(value, 0, 100, 10) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

Let’s use another CTE to define some settings for our width_bucket:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		10 as bucket_count,
		0::integer AS min_value, -- can be null::integer or an integer
		100::integer AS max_value -- can be null::integer or an integer
)

SELECT
  WIDTH_BUCKET(value,
	  (SELECT min_value FROM bucket_settings),
		(SELECT max_value FROM bucket_settings),
		(SELECT bucket_count FROM bucket_settings)
	) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;

In the bucket_settings CTE, we use ::integer to cast any value there as an integer. We do this since we will want to compare NULL against other integers later. If we don’t cast NULLs then the SQL will fail.

Now, we will use a CTE called calculated_bucket_settings to set a dynamic range if the static range is not defined. This will let the data specify the values if they are not defined by the bucket_settings:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		5 AS bucket_count,
		null::integer AS min_value, -- can be null or an integer
		null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
		(SELECT bucket_count FROM bucket_settings) AS bucket_count,
		COALESCE(
			(SELECT min_value FROM bucket_settings),
			(SELECT min(value) FROM formatted_data)
		) AS min_value,
		COALESCE(
			(SELECT max_value FROM bucket_settings),
			(SELECT max(value) + 1 FROM formatted_data)
		) AS max_value
), histogram AS (
  SELECT
     WIDTH_BUCKET(value, min_value, max_value, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
)

SELECT
   bucket,
   frequency,
   CONCAT(
     (min_value + (bucket - 1) * (max_value - min_value) / bucket_count)::INT,
     ' - ',
     (((min_value + bucket * (max_value - min_value) / bucket_count)) - 1)::INT) AS range
FROM histogram, calculated_bucket_settings;

In the histogram CTE, we use max_value + 1 because the range of values is treated as an exclusive range. Also, because we are working with integers, when you create the pretty label for the range, we subtracted 1 from the maximum value for the range to reduce confusion from what would appear to be overlapping ranges. This decision fits into the “good-enough for business intelligence” caveats listed above. We could have changed the label logic to be 75 <= value < 94 in lieu of the subtraction, but most folks like it see the dash instead of math logic for a histogram.

The query above will give results like the following:

bucket   | frequency |  range
---------+-----------+---------
       1 |         3 | 1 - 18
       3 |         4 | 38 - 55
       4 |         1 | 56 - 74
       5 |         1 | 75 - 93
(4 rows)

Now we see that all buckets and frequencies are not represented. So, if a value is empty, we need to fill in the frequency with a zero. This is where SQL requires thinking in sets. We can use generate_series to generate all values for the buckets, then join the histogram to all values. Flipping the order of the query around makes it simpler than joining an incomplete set. In the following query, we’ve built out the buckets in the all_buckets CTE, then joined that to the histogram values:

WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
  SELECT
        5 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
    COUNT(value) AS frequency
  FROM formatted_data, calculated_bucket_settings
  GROUP BY 1
  ORDER BY 1
 ), all_buckets AS (
  SELECT
    fill_buckets.bucket AS bucket,
    FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
    FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
  FROM calculated_bucket_settings,
	  generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

Try modifying the values in the bucket_settings CTE to see how the histogram responds. By increasing the bucket_count, min_value, or max_value, you’ll see the histogram respond appropriately. If you modify the range to exclude values, using the FULL OUTER JOIN, you’ll see that all non-classified items are bucketed as “out of bounds”.

Using a presentation tool, display the histogram as a bar chart (shown below using Metabase):

Real Life Data with Histograms

Now that we have a really nice auto-adjusting query, we can simply build a histogram from other examples. I have a little experimental database from the database of clinical trials.

What if we wanted to build a histogram for the count of participants in various clinical trial studies? To start, build the query that finds the number of participants for each study:

SELECT
	outcomes.nct_id,
	max(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1

We can take the above query, and place it in the formatted_data CTE:

WITH formatted_data AS (
	SELECT
		outcomes.nct_id,
		MAX(outcome_counts.count) AS value
	FROM outcomes
	INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
	WHERE param_type = 'COUNT_OF_PARTICIPANTS'
	GROUP BY 1
), bucket_settings AS (
  SELECT
        20 AS bucket_count,
        null::integer AS min_value, -- can be null or an integer
        null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
 ), all_buckets AS (
   SELECT
     fill_buckets.bucket AS bucket,
     FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
     FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
   FROM calculated_bucket_settings,
	   generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;

The query will output the following. This is a bit un-desirable because the distribution is concentrated in the first bucket:

 bucket |       range       | frequency
--------+-------------------+-----------
      1 | 1 - 359943        |     23261
      2 | 359944 - 719886   |         3
      3 | 719887 - 1079829  |         1
      4 | 1079830 - 1439773 |         0
      5 | 1439774 - 1799716 |         1
      6 | 1799717 - 2159659 |         0
      7 | 2159660 - 2519602 |         0
      8 | 2519603 - 2879546 |         0
      9 | 2879547 - 3239489 |         0
     10 | 3239490 - 3599432 |         0
     11 | 3599433 - 3959375 |         0
     12 | 3959376 - 4319319 |         0
     13 | 4319320 - 4679262 |         0
     14 | 4679263 - 5039205 |         0
     15 | 5039206 - 5399148 |         0
     16 | 5399149 - 5759092 |         0
     17 | 5759093 - 6119035 |         0
     18 | 6119036 - 6478978 |         0
     19 | 6478979 - 6838921 |         0
     20 | 6838922 - 7198865 |         1
(20 rows)

If you’ve loaded the data, to improve the presentation, we can adjust the bucket_settings CTE to modify how the buckets are defined. For instance, with this dataset, if we changed the bucket settings to:

  SELECT
        20 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer

It outputs a much nicer distribution of data:

 bucket |     range     | frequency
--------+---------------+-----------
      1 | 0 - 49        |     13584
      2 | 50 - 99       |      3612
      3 | 100 - 149     |      1720
      4 | 150 - 199     |       942
      5 | 200 - 249     |       645
      6 | 250 - 299     |       477
      7 | 300 - 349     |       338
      8 | 350 - 399     |       237
      9 | 400 - 449     |       176
     10 | 450 - 499     |       137
     11 | 500 - 549     |       150
     12 | 550 - 599     |       101
     13 | 600 - 649     |        77
     14 | 650 - 699     |        58
     15 | 700 - 749     |        61
     16 | 750 - 799     |        41
     17 | 800 - 849     |        41
     18 | 850 - 899     |        33
     19 | 900 - 949     |        36
     20 | 950 - 999     |        43
        | out of bounds |       758

In brief

Using Postgres width_bucket will build buckets to gather frequency values to create histograms.
- Creating a function assigns values to predefined buckets based on a min/max range and bucket count.
- By casting, you can work with data that contains some null values
- You can create values that fall outside the defined range
By using Common Table Expressions (CTEs), you can define bucket settings dynamically with auto-adjusting bins based on the dataset.
Histograms can aid with the visualization of data and data distribution in your set. Histograms show how frequently data points appear within specific ranges (bins), making it easier to understand patterns, trends, and outliers. Bin size does affect interpretation so choosing the right number of bins is crucial; too few can oversimplify the data, while too many can create noise and obscure trends.

Build an interesting histogram? Show us @crunchydata!

8 Steps in Writing Analytical SQL Queries

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 08 Nov 2024 09:30:00 EST

It is never immediately obvious how to go from a simple SQL query to a complex one -- especially if it involves intricate calculations. One of the “dangers” of SQL is that you can create an executable query but return the wrong data. For example, it is easy to inflate the value of a calculated field by joining to multiple rows.

Use Crunchy Playground to follow allow with this blog post using a Postgres terminal:

Postgres Playground w/ Sample Data

Let’s take a look at a sample query. This appears to look for a summary total of invoice amounts across teams. If you look closely, you might see that the joins would inflate a team’s yearly invoice spend for each team member.

SELECT
	teams.id,
	json_agg(accounts.email),
	SUM(invoices.amount)
FROM teams
	INNER JOIN team_members ON teams.id = team_members.team_id
	INNER JOIN accounts ON teams.id = team_members.team_id
	INNER JOIN invoices ON teams.id = invoices.team_id
WHERE lower(invoices.period) > date_trunc('year', current_date)
GROUP BY 1;

The query is joining invoices to teams after already joining team_members to teams. If a team has multiple members and multiple invoices, each invoice amount could be counted multiple times in the SUM(invoices.amount) calculation.

Building SQL from the ground up

The above error may not be immediately obvious. This is why it’s better to start small and use building blocks.

Writing complex SQL isn’t as much “writing a query” as it is “building a query.” By combining the building blocks, you get the data that you think you are getting. To write a complex query, loop through the following steps until you get to the intended data:

Using words, define the data
Investigate available data
Return the simplest data
Confirm the simple data
Augment the data with joins
Perform summations
Augment with details or aggregates
Debugging

Let’s step through this above query example, getting sum aggregate totals, to learn my method for building a query.

Step 1: In human words, write what you want

Write a description, and know it is okay if it changes. Data exploration may mean the data is different than expected. But, it’s a starting point. I usually do this by adding a comment at the top of the editor:

-- Return all teams, email addresses for the team, and the
-- year-to-date total spend

Step 2: Investigate the data in the tables

Even when familiar with the date set, I spend time to ensure the data has not changed. First, if using psql, list the tables:

psql> \dt
psql> \d invoices

There are many SQL clients, and all of them should enable listing and viewing tables and table structures. To further inspect, write a simple query to sample the data:

SELECT * FROM invoices;

Try this on a few different tables. By inspecting column names and columns data, I can see a pattern of relationships. When exploring a dataset created by someone else, it can be difficult to determine relationships. Data isn’t always clean. Columns may not be incorrectly named. "Magic strings" and "magic integers" may not make sense. Multiple application developers implement different philosophies with data structures.

To verify table structures, I take a two step approach: 1) compare it to known data, and 2) ask people involved with the project. When asking a person about the structure of data, they will never respond with "yes" or "no" -- the data structure always has a story. It’s important to verify relationships -- it is possible to join two non-related fields.

Step 3: Find the simplest data first

In this scenario, the easiest data is returning the invoice. We also want to calculate the team spend for the year. First, reduce to invoices that should go into the calculations:

SELECT
	*
FROM invoices
WHERE lower(period) > date_trunc('year', current_date);

Look over the data and confirm the returned rows match expected data: included and excluded. When viewing the data, add a where conditional for attributes that should be excluded. A common issue with missing rows on conditionals is NULL values. The following conditional will also exclude when deleted_at is NULL:

AND deleted_at < date_trunc('year', current_date)

To include NULL values, the conditional will need to be expanded to:

AND (deleted_at < date_trunc('year', current_date) OR
deleted_at IS NULL)

Step 4: Confirm the simple data

When working through complex queries that require precision like financial reports, you may need to audit the results row by row. Step through each row and confirm the results. Then, step through a known set of good data and ensure data is not missing. Many mis-written SQL queries are found via this 2-sided verification.

Step 5: Add joins, but do not add calculations yet

Start with the most reasonable joins first. This being an example, the idea that we don't know the data is false. In the real world, this step requires additional experimentation and data validation from team members:

SELECT
	*
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
WHERE lower(period) > date_trunc('year', current_date);

After adding the joins, run the query and inspect the data. By joining the data, the query is returning more columns. Start limiting the response to the columns to be used. Remove the * and go with column names:

SELECT
	invoices.period,
	invoices.amount,
	teams.id,
	teams.name
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
WHERE lower(period) > date_trunc('year', current_date);

Once that works, add additional joins until it breaks. In this example, experiment by adding team_members:

SELECT
	invoices.period,
	invoices.amount,
	teams.id,
	teams.name
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
	INNER JOIN team_members ON teams.id = team_members.team_id
WHERE lower(period) > date_trunc('year', current_date);

But that has duplicate rows -- previously, the query returned 602 rows and now it returns 3749 rows. Why? When joining teams and team_members, one-to-many relationship adds one row for each additional team member. In this case, we would step back to go forward. Remove the latest value and encapsulate the value.

Common issues during this phase include:

typos in join conditional -- when working with tables with similar names, it is easy to insert an incorrect join condition. For instance, the following query will execute, and will return completely the wrong data, can you spot the error?

SELECT
	invoices.period, invoices.amount, teams.id, teams.name
FROM invoices
	INNER JOIN teams ON invoices.id = teams.id
WHERE lower(period) > date_trunc('year', current_date);

The other question is: what kind of join should I use? Quick refresher:

INNER JOIN is an exclusive join. Only rows with matching rows in the joined table, then the value is not returned.
LEFT JOIN is a non-exclusive join. All rows from the previously requested table will be returned, and the joined table will be returned if it exists
OUTER JOIN all rows from all tables will be returned, if unfound the other table will return NULL.

Step 6: Perform summations

Let’s rewind back to what works, and package it into a CTE that we can use as a join. As you make changes, you'll make some wrong steps -- that is common. Know how to get back to a working query. Often that requires undo-ing changes to a working state.

Once I get to a working state, then I package the bit of data into a CTE (or common table expression):

WITH team_yearly_spend AS (
	SELECT
		invoices.period AS invoice_period,
		invoices.amount AS invoice_amount,
		teams.id AS team_id,
		teams.name AS team_name
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
)

SELECT * FROM team_yearly_spend;

Notice the use of AS to declare unique names for a column. When building complex queries, I favor verbosity to limit mistakes.

Let's add aggregations to the CTE:

WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name AS team_name,
		SUM(invoices.amount) AS team_yearly_spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
	GROUP BY 1
)

SELECT
	*
FROM team_yearly_spend;

Step 7: Lastly, augment data to include details

To include team member emails as specified at the beginning, we will join the team members to the statement outside the CTE:

WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name,
		SUM(invoices.amount) AS spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
	GROUP BY 1
)

SELECT
	team_yearly_spend.team_id,
	team_yearly_spend.spend,
	COUNT(DISTINCT accounts.id) AS accounts_count,
	JSON_AGG(accounts.email) AS account_emails
FROM team_yearly_spend
LEFT JOIN team_members ON team_yearly_spend.team_id = team_members.team_id
LEFT JOIN accounts ON team_members.account_id = accounts.id
GROUP BY 1, 2
;

Step 8: Debugging

To debug output errors, I find it easier to remove the calculations to get to the raw data. When using a query editing tool that allows running of the a visually selected query (like DBeaver), I comment out the aggregations and add a * to return more values:

-- WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name,
		*
--		SUM(invoices.amount) AS spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
--	GROUP BY 1
--)

With this response, look for:

rows duplicated by joins,
rows that should be present, yet are missing due to a bad conditional,
rows that are included that should be filtered out with a conditional.

Debugging SQL queries is a simple process, but it’s not an easy process. It requires a data audit, usually best to compare against a known value.

Why is SQL complex?

The schema for the example above was an example of an application data structure with OLTP in mind. The SQL that we have just written can use that schema and generate values for report generation or for display to application users. That is the great thing about SQL -- no matter how the underlying structure is represented, we can get the data we want to get out of it.

SQL is powerful because it’s built using simple, standardized blocks of logic.

Writing SQL is a non-linear process. I've never seen someone start at the top of a long-SQL query and type through to the end. It is a process that involves multiple levels of extraction, verification, and summation.

4 Ways to Create Date Bins in Postgres: interval, date_trunc, extract, and to_char

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Tue, 29 Oct 2024 10:30:00 EDT

You followed all the best practices, your sales dates are stored in perfect timestamp format …. but now you need to get reports by day, week, quarters, and months. You need to bin, bucket, and roll up sales data in easy to view reports. Do you need a BI tool? Not yet actually. Your Postgres database has hundreds of functions that let you query data analytics by date. By using some good old fashioned SQL - you have powerful analysis and business intelligence with date details on any data set.

In this post, I’ll walk through some of the key functions querying data by date.

For a summary of the best ways to store date and time in Postgres, see Working with Time in Postgres. We also have interactive web based tutorial with lots of sample code for working with data by date, with sample data set of ecommerce orders.

Interval - the Swiss-army knife of date manipulation

The interval is a data type used to modify other times. For instance, an interval can be added or subtracted from a known time. Interval is super handy and the first place you can go to quickly summarize data by date. Like a Swiss-army knife, it’s not always the best tool for the job, but it can be used in a pinch. Let’s talk about where it excels.

How can we run a query that returns the total sum of orders for the last 90 days? Of course, interval can be used. Without interval, we often see people using a date variable passed from an external source that has generated a date. Using now() - INTERVAL '90 days', you can use the same query no matter the date. The other secret sauce is the use of now() which is a timestamp for the current time on the server.

SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= NOW () - INTERVAL '90 days';

    sum
-----------
 259472.99
(1 row)

Instead of using now(), current_date can be used to return a date instead of a time.

SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= current_date - INTERVAL '90 days';

These two queries are different — current_date starts at the beginning of the day, and now() will include a time throughout the day. When using now() the results will match only those that occurred after the current time 90 days ago.

Commonly, people use a shorter form for intervals using cast, but it’s the same query:

SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= NOW() - '90 days'::interval;

Using interval for binning

To create interval ranges, we can combine the use of CASE with interval. SQL’s CASE performs conditional logic within queries. The format for CASE is WHEN .. THEN , below is a query that executes a sample case statement:

SELECT
  CASE
    WHEN false THEN 'not this'
    WHEN true THEN 'this will show'
    ELSE 'never makes it here'
  END;

Now, let’s categorize orders into the time ranges: "30-60 days ago", "60-90 days ago"

SELECT
    CASE
        WHEN order_date BETWEEN (NOW() - INTERVAL '60 days') AND (NOW() - INTERVAL '30 days')
            THEN '30-60 days ago'
        WHEN order_date BETWEEN (NOW() - INTERVAL '90 days') AND (NOW() - INTERVAL '60 days')
            THEN '60-90 days ago'
    END AS date_range,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_sales
FROM
  orders
WHERE
  order_date BETWEEN (NOW() - INTERVAL '90 days') AND (NOW() - INTERVAL '30 days')
GROUP BY
  date_range
ORDER BY
  date_range;

   date_range   | total_orders | total_sales
----------------+--------------+-------------
 30-60 days ago |          160 |   101754.20
 60-90 days ago |          128 |    88086.24

This may look a bit complicated, but the conditional for the statement is order_date BETWEEN begining_date_value AND ending_date_value . Since CASE statements end after the first truthy conditional, we can simplify this a bit more:

SELECT
    CASE
	    WHEN order_date >= NOW() - '30 days'::interval THEN '00-30 days ago'
	    WHEN order_date >= NOW() - '60 days'::interval THEN '30-60 days ago'
	    ELSE
		    '60-90 days ago'
	  END AS date_range,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_sales
FROM
  orders
WHERE
  order_date >= NOW() - '90 days'::interval
GROUP BY
  date_range
ORDER BY
  date_range;

It’s best to choose a pattern depending on how explicit you want to be with your SQL queries. Using BETWEEN is more explicit, and may be best for teams choosing more explicit queries. The hard part about using INTERVAL is that recent time is greater than older time — so the >= may break the brains of those who haven’t used a lot of time manipulation.

In summary: use interval for binning continuous time.

date_trunc - the easiest function for date binning

Use date_trunc for binning of pre-defined time: like day, week, month, quarter, and year. Where interval logic can be complicated, date_trunc is dead simple.

At a glance, date_trunc’s name might indicate that its about formatting, but it is more powerful when combined with GROUP BY. date_trunc is an essential part of the query toolkit when working with analytics. Simple uses of date_trunc is like the following:

/* show the beginning of the first day of the month */
SELECT date_trunc('month', current_date);

/* show the beginning of the first day of the week */
SELECT date_trunc('week', current_date);

/* show the beginning of the first day of the year */
SELECT date_trunc('year', current_date);

/* show the beginning of the first day of the current quarter */
SELECT date_trunc('quarter', current_date);

To generate a date bin, extract the period of time from the record’s date. For instance, let’s write a query to show the monthly number of orders and total order sales:

SELECT
  date_trunc ('month', order_date) AS month,
  COUNT(*) AS total_orders,
  SUM(total_amount) AS monthly_total
FROM
  orders
GROUP BY 1
ORDER BY
  month;

Results would look like:

        month        | total_orders | monthly_total
---------------------+--------------+---------------
 2024-08-01 00:00:00 |           11 |       2699.82
 2024-09-01 00:00:00 |           39 |       8439.41
(2 rows)

Using GROUP BY , Postgres counts and summates based on the unique values returned by the the date_trunc function. The available bins for date_trunc are: millennium, century, decade, year, quarter, week, day, hour, minute, second, millisecond.

Extract - sometimes you have to do something funky

Not all dates are nicely broken into days, months, years, etc. The extract function extracts a specific value for a date / time type. For instance, I commonly use extract for the following:

/* returns the epoch value for a date / time    */
/* I this use to send date values to Javascript */
SELECT extract('epoch' from current_date);

/* returns the hour from a time type */
SELECT extract('hour' from now());

How can this be used to bin values? For example, if you wanted to find which hours of which day of the week has the highest number and sales value of orders:

SELECT
    extract('dow' from order_date) AS day_of_week,
    extract('hour' from order_date) AS hour,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS monthly_total
FROM
    orders
GROUP BY 1, 2
ORDER BY 1, 2;

 day_of_week | hour | total_orders | monthly_total
-------------+------+--------------+---------------
           0 |   23 |           35 |      23631.56
           1 |    0 |           31 |      19299.88

You'll see here Sunday is '0` and Saturday is '6'.

Where date_trunc keeps the higher context, extract removes all context except that which is requested.

to_char - extreme makeover date edition

It’s awkward because to_char is both the most versatile and most hated function for date binning. The function will accept time / date, text, or numbers for additional formatting, so it’s not explicitly for date functions. It’s never failed, when I’ve used to_char, someone has told me that I could have used a better function. It can produce human readable values quickly, but it’s unsuited for data sent for additional machine processing.

Here are a few examples of to_char:

/* extract current day of week and current hour of day based on UTC */
SELECT to_char(now(), 'DayHH24');

/* extract current day of week and current hour of day based on NYC time zone */
SELECT to_char(now() AT TIME ZONE 'America/New_York' , 'DayHH24');

This outputs the current day of the week, and current hour based on UTC time. This breaks your brain right? What does the “DayHH24” portion mean? Postgres documentation has a long list for reserved strings used by to_char:

To change the presentation of a month, using to_char to extract and format the name and year:

SELECT to_char(order_date, 'FMMonth YYYY') AS formatted_month,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS monthly_total
FROM
    orders
GROUP BY 1
ORDER BY 1;

 formatted_month | total_orders | monthly_total
-----------------+--------------+---------------
 August 2024     |           11 |       2699.82
 September 2024  |           39 |       8439.41

Escaping reserved strings in to_char:

The common format for quarters in finance is “Q1” / “Q2” / “Q3” and “Q4”. Using to_char, we can extract the quarter for a time in that format. But, the “Q” is a reserved keyword for quarter. To print a “Q” without evaluating it, wrap it in double quotes:

SELECT
    to_char(order_date, '"Q"Q-YYYY') AS formatted_quarter,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY 1
ORDER BY 1;

 formatted_quarter | total_amount
-------------------+--------------
 Q1-2022           |    313872.84
 Q1-2023           |    282774.15
 Q1-2024           |    287379.33

Summary

Binning is an essential tool for faceting the data for financial reports and data analysis. Dates and times are a more complex piece of information than they first appear — hours, months, hours, quarters, years. So, a single date can be facetted many ways.

Luckily, Postgres has the functions you need to work with dates. For a quick summary:

interval - modifies date / times by adding / subtracting

date_trunc - truncates a date / time — essentially rounding-down to the closest value

extract - extracts a single piece of information from a date / time (day, week, month, quarter, year)

to_char - formats output into a specific style of date format or text string.

Using acts_as_tenant for Multi-tenant Postgres with Rails

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Wed, 20 Dec 2023 08:00:00 EST

Since its launch, Ruby on Rails has been a preferred open source framework for small-team B2B SaaS companies. Ruby on Rails uses a conventions-over-configuration mantra. This approach reduces common technical choices, thus elevating decisions. With this approach, the developers get an ORM (ActiveRecord), templating engine (ERB), helper methods (like number_to_currency), controller (ActiveController), directory setup defaults (app/{models,controllers,views}), authentication methods (has_secure_password), and more.

Multi-tenant is the backbone of B2B SaaS products, yet core-Rails remains un-opinionated on multi-tenant implementations. Through the years, there has been many different Ruby gem implementations for multi-tenant. Many of these gems were built for complicated situations — either adapting to scaling needs or regulated industries that require physical separation of data. Many of these gems required deep integration with your Rails application code.

Enter acts_as_tenant

With all that as context, the acts_as_tenant gem is super simple. acts_as_tenant has recently released version 1.0 after 12 years of development — so it’s not new. The gem implements multi-tenant best-practices by augmenting Rails’ ActiveRecord ORM:

protects developers from building queries that return other tenant’s records
requires a tenant_id on the tables for models specific to a tenant
adds the tenant_id scope to the query
includes ActionController, ActiveRecord, ActiveJob helpers to insert new records with the scoped tenant

Acts_as_tenant is built for row-level multi-tenancy, and that is it. So, no need to manage multiple databases or schemas for data structures — it keeps it simple. One of the best things I can say about acts_as_tenant is that it can be implemented by an existing application code-base. Too many times, with the older multi-tenant gems, the implementation was invasive, and thus required complex refactoring.

What it’s not: acts_as_tenant is not for account-based sharding — either schema-based or multi-cluster based sharding. It’s purely for multi-tenant safety.

For the paranoid

I have built a few multi-tenant apps in industries with data regulation (think finance and education). I am overly cautious when building multi-tenant apps — so this guardrail is my favorite.

To enforce the tenant_id on every ActiveRecord query within an application, add the following to a initializer file in config/initializers/acts_as_tenant.rb:

ActsAsTenant.configure do |config|
	config.require_tenant = true
end

Having worked in a few multi-tenant apps where showing data to another customer is consequential, I wish acts_as_tenant had an enforcing requirement of a tenant_id for queries. One of the apps I wrote required high-performance, large-scale data loads. We had an intermittent bug where people would be assigned to the incorrect tenant. After tracking down the bug, we found the incident in the implementation of multiple external_ids:

-- bug code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND external_id = $2 OR other_external_id = $2;

-- correct code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND (external_id = $2 OR other_external_id = $2);

The lesson: wrap your OR statements in parenthesis. The bug code interpreted as:

(tenant_id = %1 AND external_id = $2) OR other_external_id = $2;

When using acts_as_tenant, you can avoid this bug when using ActiveRecord models. Below, you’ll see that ActiveRecord encapsulates the following:

Remember, if you choose to use raw SQL, you’ll need to keep your guard up.

Testing from `rails new app`

To install from a new rails application, do the following:

Run rails new multi-tenant-app
Decide on your application’s tenant model: typically Organization or Account or Team or School. Use the underscore version of the name with _id appended as your tenant id for all columns, such as organization_id or account_id or team_id or school_id. Below, we will use the tenant name Account.
Add gem "acts_as_tenant" to Gemfile, and run bundle install.
Create some models:

rails g model Account name:string
rails g model User email:string account_id:integer
rails g model Post content:string user_id:integer account_id:integer

rails db:create && rails db:migrate

Add the following to app/models/account.rb

class Account < ApplicationRecord

  has_many :users
  has_many :posts

end

Add the following to app/models/post.rb:

class Post < ApplicationRecord

  belongs_to :user
  acts_as_tenant :account

end

Add the following to app/models/user.rb:

class User < ApplicationRecord
  acts_as_tenant :account
  validates_uniqueness_to_tenant :email
end

Now, let’s experiment with the Rails REPL:

rails console

Then, you can run the following commands:

first_account = Account.create!(name: "First Account")
last_account = Account.create!(name: "Last Account")

ActsAsTenant.with_tenant(first_account) do
  user = User.create!(email: "test@example.com")
  post = Post.create!(user: user, content: "Lorem Ipsum")
end

ActsAsTenant.with_tenant(first_account) do
  Post.first.content # -> "Lorem Ipsum"
end

ActsAsTenant.with_tenant(last_account) do
  Post.first.nil? # -> true because we did not create a tenant
end

Post.first.content # -> "Lorem Ipsum"

ActsAsTenant.configure do |config|
  config.require_tenant = true
end

Post.first.content # -> ActsAsTenant::Errors::NoTenantSet (ActsAsTenant::Errors::NoTenantSet)

When looking at the queries that are run by ActiveRecord, you’ll see it automatically appends the account_id to the User and Post that are created. Later, after we set require_tenant, you’ll see that the next command fails with an error.

From the terminal, we explicitly used with_tenant. acts_as_tenant has helpers for the controller as well. Depending on how your authentication systems and tenancy work, you can use domains, subdomains, or implicit tenancy based on the authenticated user. From here, you’ll need to implement something like:

class ApplicationController < ActionController::Base
  set_current_tenant_through_filter
  before_action :require_authentication
  before_action :set_tenant

  def require_authentication
    current_user || redirect_to(new_session_path)
  end

  def current_user
    @current_user ||= if session[:user_id].present?
			User.find(session[:user_id])
    end
  end

  def current_acount
    @current_account ||= current_user.try(:account)
  end

  def set_tenant
    set_current_tenant(current_account)
  end
end

Implementation of proper authentications are complex, so this is simply for example. The code specific to acts_as_tenant are set_current_tenant_through_filter and before_action :set_tenant and def set_tenant.

Migrating to acts_as_tenant

If you have an existing codebase that would benefit from acts_as_tenant, the migration is a process and can be broken into multiple steps:

Add a tenant_id column to each affected model - this step can be quite complicated. It requires data migrations and data updates. The method of updating columns will be dependent on the size of your database.
Add the acts_as_tenant gem, but do not set require_tenant yet
Define the tenancy for your ApplicationController using either domains, subdomains, or filter
Define the tenancy for your Action Job
Define tenancy for your models

Taking a measured approach to migrating, you can deploy each of the steps above independently. And, you can deploy each model change independently of the entire change.

Removing acts_as_tenant

The best thing I can say about a library is: you can migrate away from it if it does not work for you. Because acts_as_tenant is not a deep integration as past multi-tenant libraries, it is possible to move away from acts_as_tenant.

Summary

Back in the 2009-ish era, Ruby on Rails and “The Cloud” grew up together when cloud-SaaS and social networks took off. Back then, the maximum performance of network attached storage was 100 IOPs and size maxed out at 1TB. The IOPs strangled database performance, and 1TB was an unbreakable limitation (if you did not RAID early). I started my career in that era. Due to infrastructure limitations, multi-tenant databases would start to see issues when an application hit as little as 50 requests per second. In this era, RAM was expensive and disk performance was not available. Because of this, “sharding” was talked about at all the conferences.

Side note: also, data was suddenly available everywhere, and there were business models that stored massive amounts of data hoping to figure out a business model later.

Now, in 2023, RAM is plentiful and IOPs are available. Scaling the database can be punted to 10s of thousands of requests per second.

Why do I say all this? Because now, we can approach multi-tenant apps and scaling more practically. Multi-tenant can focus on data-security and coding-practically instead of scaling. You may not ever get to the point of needing distributed data stores, but a solid multi-tenant implementation creates foundational success for your application.

The old multi-tenant Ruby Gems were for scalability. acts_as_tenant is built for practicality.