<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>Christopher Winslett | CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/author/christopher-winslett/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/author/christopher-winslett</link>
<image><url>https://www.crunchydata.com/build/_assets/christopher-winslett.png-53E7CT5Z.webp</url>
<title>Christopher Winslett | CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/author/christopher-winslett</link>
<width>335</width>
<height>322</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Fri, 24 Oct 2025 09:00:00 EDT</pubDate>
<dc:date>2025-10-24T13:00:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Temporal Joins ]]></title>
<link>https://www.crunchydata.com/blog/temporal-joins</link>
<description><![CDATA[ How do you return the Nth related record from joined tables in PostgreSQL? Did you know that it has a few more options than other databases? Use DISTINCT ON and window functions that solve common temporal join challenges and avoid N+1 query problems. ]]></description>
<content:encoded><![CDATA[ <p>My first thought seeing a temporal join in 2008 was, “Why is this query so complex?” The company I was at relied heavily on database queries, as it was a CRM and student success tracking system for colleges and universities. The query returned a filtered list of users and their last associated record from a second table. The hard part about the query isn’t returning the last timestamp or even performing joins, it’s returning <em>only their last associated record</em> from a second table.<p>Back in 2008, we didn’t have window functions or CTEs, so the query algorithm was a series of nested tables that looked like this:<pre><code class=language-sql>SELECT
    *
FROM users, ( -- find the record for the last second_table by created_at and user_id
                SELECT
                    second_table.*
                FROM second_table, ( -- find the last second_table created_at per user_id
                                        SELECT
                                            user_id,
                                            max(created_at) AS created_at
                                        FROM second_table
                                        GROUP BY 1
                                    ) AS last_second_table_at
                WHERE
                    last_second_table_at.user_id = second_table.user_id
                    AND second_table.created_at = last_second_table_at.created_at
            ) AS last_second_table
WHERE users.id = last_second_table.user_id;
</code></pre><p><strong>See the Sample Code section below for the schema and data to run these queries.</strong><p>But, even that query was wrong because the second table may have records with <strong>duplicate</strong> <code>created_at</code> values. That was the source of a bug back in 2008 that resulted in duplicate rows being listed.<p>Obviously, we weren't using Postgres at the time because there has always been a simpler way to do this in Postgres using <code>DISTINCT ON</code>:<pre><code class=language-sql>SELECT DISTINCT ON (u.id)
    u.id,
    u.name,
    s.created_at AS last_action_time,
    s.action_type
FROM users u
JOIN second_table s ON u.id = s.user_id
ORDER BY u.id, s.created_at DESC, s.id DESC;
</code></pre><p>Temporal joins require attention to detail.<h2 id=robust-solution-ctes--window-functions><a href=#robust-solution-ctes--window-functions>Robust Solution: CTEs &#38 Window Functions</a></h2><p>Before we go too far into the topic, for those looking for a solution to their current problem, below is how I would write that query today if you aren't finding the <strong>first or last</strong> in a series. For these situations, we use <a href=https://www.crunchydata.com/blog/postgres-subquery-powertools-subqueries-ctes-materialized-views-window-functions-and-lateral>CTEs and window functions</a>, so there's no need to nest queries when we can abstract them for a cleaner purpose. Here is the template for the temporal joins that do not work with <code>DISTINCT ON</code>:<pre><code class=language-sql>WITH max_second_table AS (
    SELECT
        *
    FROM (
            SELECT
                *,
                -- Use ROW_NUMBER() window function to return the latest record:
                -- The ORDER BY clause is critical:
                -- 1. ORDER BY created_at DESC finds the latest time.
                -- 2. ORDER BY id DESC serves as a reliable tie-breaker
                ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) AS row_order
            FROM second_table
        ) AS ordered_second_table
    WHERE row_order = 2
)

SELECT
    *
FROM users
LEFT JOIN max_second_table ON users.id = max_second_table.user_id;
</code></pre><p>In this example, we are joining the <strong>second occurrence</strong> (<code>WHERE row_order = 2</code>) in the <code>second_table</code> for a user. For the university example, we used these types of queries to report on progress over time by showing the 1st, 2nd, 3rd, and <mjx-container class=MathJax jax=SVG><svg focusable=false height=1.932ex role=img style=vertical-align:0 viewBox="0 -853.7 1687.8 853.7"width=3.819ex xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink><defs><path d="M234 637Q231 637 226 637Q201 637 196 638T191 649Q191 676 202 682Q204 683 299 683Q376 683 387 683T401 677Q612 181 616 168L670 381Q723 592 723 606Q723 633 659 637Q635 637 635 648Q635 650 637 660Q641 676 643 679T653 683Q656 683 684 682T767 680Q817 680 843 681T873 682Q888 682 888 672Q888 650 880 642Q878 637 858 637Q787 633 769 597L620 7Q618 0 599 0Q585 0 582 2Q579 5 453 305L326 604L261 344Q196 88 196 79Q201 46 268 46H278Q284 41 284 38T282 19Q278 6 272 0H259Q228 2 151 2Q123 2 100 2T63 2T46 1Q31 1 31 10Q31 14 34 26T39 40Q41 46 62 46Q130 49 150 85Q154 91 221 362L289 634Q287 635 234 637Z"id=MJX-1-TEX-I-1D441 /><path d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"id=MJX-1-TEX-I-1D461 /><path d="M137 683Q138 683 209 688T282 694Q294 694 294 685Q294 674 258 534Q220 386 220 383Q220 381 227 388Q288 442 357 442Q411 442 444 415T478 336Q478 285 440 178T402 50Q403 36 407 31T422 26Q450 26 474 56T513 138Q516 149 519 151T535 153Q555 153 555 145Q555 144 551 130Q535 71 500 33Q466 -10 419 -10H414Q367 -10 346 17T325 74Q325 90 361 192T398 345Q398 404 354 404H349Q266 404 205 306L198 293L164 158Q132 28 127 16Q114 -11 83 -11Q69 -11 59 -2T48 16Q48 30 121 320L195 616Q195 629 188 632T149 637H128Q122 643 122 645T124 664Q129 683 137 683Z"id=MJX-1-TEX-I-210E /></defs><g transform=scale(1,-1) fill=currentColor stroke=currentColor stroke-width=0><g data-mml-node=math><g data-mml-node=msup><g data-mml-node=mi><use data-c=1D441 xlink:href=#MJX-1-TEX-I-1D441 /></g><g data-mml-node=TeXAtom transform="translate(975.3,363) scale(0.707)"data-mjx-texclass=ORD><g data-mml-node=mi><use data-c=1D461 xlink:href=#MJX-1-TEX-I-1D461 /></g><g data-mml-node=mi transform=translate(361,0)><use data-c=210E xlink:href=#MJX-1-TEX-I-210E /></g></g></g></g></g></svg></mjx-container> occurrences of events.<p>Is this actually less code than the first example? No, but it is compartmentalized with a cleaner purpose.<p>Also, introducing the <strong>primary key (<code>id</code>)</strong> in the <code>ORDER BY</code> clause provides the necessary <strong>tie-breaker logic</strong> for the sorting -- that is how we fixed the SQL issue in the opening text.<h2 id=problem-with-orms><a href=#problem-with-orms>Problem with ORMs</a></h2><p>Due to their query complexity, ORMs are generally not capable of handling temporal joins without complex manipulation. The ORM I'm most familiar with is <strong>ActiveRecord</strong>, part of the Ruby on Rails suite. When Rails developers encounter temporal joins, they typically resort to the <strong><a href=https://www.crunchydata.com/blog/postgresql-for-solving-n+1-queries-in-ruby-on-rails>N+1 query pattern</a></strong> from their application code like this:<pre><code class=language-ruby>users = User.all

users.each do |user|
  last_action = user.second_table.last
end
</code></pre><p>If you aren't running this query too frequently or over too many user records, this is generally <em>performant enough</em>. However, this approach becomes <strong>suboptimal</strong> for application performance as the user list grows because each iteration of the loop requires a network hop back and forth with the database and an object initialization in the application. While you can make ActiveRecord do this natively, the resulting code is often harder to read and maintain for the typical use case—a pattern you see in other ORMs as well.<h2 id=sample-code><a href=#sample-code>Sample Code</a></h2><p>Below is the sample SQL you can use to load data into your database to test a few of these queries. Note that <strong>Alice has two actions at the exact same timestamp</strong> to replicate the original bug scenario.<pre><code class=language-sql>CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name TEXT
);

CREATE TABLE second_table (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action_type TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE
);

INSERT INTO users (name) VALUES ('Alice'), ('Bob'), ('Charlie');

-- Alice has two actions at the exact same timestamp (The 2008 bug scenario)
INSERT INTO second_table (user_id, action_type, created_at) VALUES
(1, 'login', '2023-10-01 10:00:00'),
(1, 'page_view', '2023-10-01 10:00:00'),
(2, 'purchase', '2023-10-02 11:00:00'),
(3, 'registration', '2023-10-03 12:00:00'),
(3, 'profile_update', '2023-10-04 13:00:00');
</code></pre><h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p>The term "temporal join" isn't a common piece of developer jargon, but the underlying pattern, retrieving the <mjx-container class=MathJax jax=SVG><svg focusable=false height=1.932ex role=img style=vertical-align:0 viewBox="0 -853.7 1687.8 853.7"width=3.819ex xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink><defs><path d="M234 637Q231 637 226 637Q201 637 196 638T191 649Q191 676 202 682Q204 683 299 683Q376 683 387 683T401 677Q612 181 616 168L670 381Q723 592 723 606Q723 633 659 637Q635 637 635 648Q635 650 637 660Q641 676 643 679T653 683Q656 683 684 682T767 680Q817 680 843 681T873 682Q888 682 888 672Q888 650 880 642Q878 637 858 637Q787 633 769 597L620 7Q618 0 599 0Q585 0 582 2Q579 5 453 305L326 604L261 344Q196 88 196 79Q201 46 268 46H278Q284 41 284 38T282 19Q278 6 272 0H259Q228 2 151 2Q123 2 100 2T63 2T46 1Q31 1 31 10Q31 14 34 26T39 40Q41 46 62 46Q130 49 150 85Q154 91 221 362L289 634Q287 635 234 637Z"id=MJX-2-TEX-I-1D441 /><path d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"id=MJX-2-TEX-I-1D461 /><path d="M137 683Q138 683 209 688T282 694Q294 694 294 685Q294 674 258 534Q220 386 220 383Q220 381 227 388Q288 442 357 442Q411 442 444 415T478 336Q478 285 440 178T402 50Q403 36 407 31T422 26Q450 26 474 56T513 138Q516 149 519 151T535 153Q555 153 555 145Q555 144 551 130Q535 71 500 33Q466 -10 419 -10H414Q367 -10 346 17T325 74Q325 90 361 192T398 345Q398 404 354 404H349Q266 404 205 306L198 293L164 158Q132 28 127 16Q114 -11 83 -11Q69 -11 59 -2T48 16Q48 30 121 320L195 616Q195 629 188 632T149 637H128Q122 643 122 645T124 664Q129 683 137 683Z"id=MJX-2-TEX-I-210E /></defs><g transform=scale(1,-1) fill=currentColor stroke=currentColor stroke-width=0><g data-mml-node=math><g data-mml-node=msup><g data-mml-node=mi><use data-c=1D441 xlink:href=#MJX-2-TEX-I-1D441 /></g><g data-mml-node=TeXAtom transform="translate(975.3,363) scale(0.707)"data-mjx-texclass=ORD><g data-mml-node=mi><use data-c=1D461 xlink:href=#MJX-2-TEX-I-1D461 /></g><g data-mml-node=mi transform=translate(361,0)><use data-c=210E xlink:href=#MJX-2-TEX-I-210E /></g></g></g></g></g></svg></mjx-container> related record, is critical in <a href=https://www.crunchydata.com/blog/window-functions-for-data-analysis-with-postgres>reporting and analytics</a>. It's a known pattern among people who have worked on a code base that relies heavily on SQL capabilties, typically for reporting.<p>Using the PostgreSQL feature <strong><code>DISTINCT ON</code></strong> for the simplest case, or <strong><a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>CTEs with Window Functions</a></strong> for complex retrieval, we avoid the bugs of older SQL patterns and eliminate the performance penalty of the N+1 problem.<p>If you would like to learn more about advanced SQL patterns, check out our <a href=https://www.crunchydata.com/developers/playground>Postgres Playground</a>.</p><style>mjx-container[jax=SVG]{direction:ltr}mjx-container[jax=SVG]>svg{overflow:visible;min-height:1px;min-width:1px}mjx-container[jax=SVG]>svg a{fill:blue;stroke:blue}mjx-container[jax=SVG][display=true]{display:block;text-align:center;margin:1em 0}mjx-container[jax=SVG][display=true][width=full]{display:flex}mjx-container[jax=SVG][justify=left]{text-align:left}mjx-container[jax=SVG][justify=right]{text-align:right}g[data-mml-node=merror]>g{fill:red;stroke:red}g[data-mml-node=merror]>rect[data-background]{fill:yellow;stroke:none}g[data-mml-node=mtable]>line[data-line],svg[data-table]>g>line[data-line]{stroke-width:70px;fill:none}g[data-mml-node=mtable]>rect[data-frame],svg[data-table]>g>rect[data-frame]{stroke-width:70px;fill:none}g[data-mml-node=mtable]>.mjx-dashed,svg[data-table]>g>.mjx-dashed{stroke-dasharray:140}g[data-mml-node=mtable]>.mjx-dotted,svg[data-table]>g>.mjx-dotted{stroke-linecap:round;stroke-dasharray:0,140}g[data-mml-node=mtable]>g>svg{overflow:visible}[jax=SVG] mjx-tool{display:inline-block;position:relative;width:0;height:0}[jax=SVG] mjx-tool>mjx-tip{position:absolute;top:0;left:0}mjx-tool>mjx-tip{display:inline-block;padding:.2em;border:1px solid #888;font-size:70%;background-color:#f8f8f8;color:#000;box-shadow:2px 2px 5px #aaa}g[data-mml-node=maction][data-toggle]{cursor:pointer}mjx-status{display:block;position:fixed;left:1em;bottom:1em;min-width:25%;padding:.2em .4em;border:1px solid #888;font-size:90%;background-color:#f8f8f8;color:#000}foreignObject[data-mjx-xml]{font-family:initial;line-height:normal;overflow:visible}mjx-container[jax=SVG] path[data-c],mjx-container[jax=SVG] use[data-c]{stroke-width:3}</style> ]]></content:encoded>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">da09e0153c614a35f860efd24fb3f98f0bff8151300e44ffb6009d17dce4c1b8</guid>
<pubDate>Fri, 24 Oct 2025 09:00:00 EDT</pubDate>
<dc:date>2025-10-24T13:00:00.000Z</dc:date>
<atom:updated>2025-10-24T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Creating Histograms with Postgres ]]></title>
<link>https://www.crunchydata.com/blog/histograms-with-postgres</link>
<description><![CDATA[ Histograms are elegant tools for visualizing distribution of values. We walkthrough building a re-usable query for your histogram needs. ]]></description>
<content:encoded><![CDATA[ <p>Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.<p>While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres <code>width_bucket</code> is good-enough to counting data inside bins with minimal effort.<h2 id=postgres-width_bucket-for-histograms><a href=#postgres-width_bucket-for-histograms>Postgres width_bucket for histograms</a></h2><p>Given the number of buckets and max/min value, <code>width_bucket</code> returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: <code>select width_bucket(43, 0, 100, 10) AS bucket;</code> But 5 is not correct for 43, or is it?<p>You can see how the values would fall using <code>generate_series</code> (shown below using <a href=https://metabase.com>Metabase</a>):<pre><code class=language-sql>SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;
</code></pre><p><img alt="postgres histogram 1-100"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/3aec99f0-696b-4d9b-0c7b-8ce9fc415200/public><p>When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, <code>width_bucket</code> behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, <code>width_bucket</code> returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum &#60= value &#60 maximum).<p>We can use the bucket value to generate more readable labels.<h2 id=auto-formatting-histogram-with-sql><a href=#auto-formatting-histogram-with-sql>Auto-formatting histogram with SQL</a></h2><p>Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within a CTE called <code>formatted_data</code>. We are doing it this way so that we can replace that query with new data in the future.<p>Here’s the beginning of the query (this is copy-pastable into Postgres):<pre><code class=language-sql>WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
)
SELECT
  WIDTH_BUCKET(value, 0, 100, 10) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;
</code></pre><p>Let’s use another CTE to define some settings for our <code>width_bucket</code>:<pre><code class=language-sql>WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		10 as bucket_count,
		0::integer AS min_value, -- can be null::integer or an integer
		100::integer AS max_value -- can be null::integer or an integer
)

SELECT
  WIDTH_BUCKET(value,
	  (SELECT min_value FROM bucket_settings),
		(SELECT max_value FROM bucket_settings),
		(SELECT bucket_count FROM bucket_settings)
	) AS bucket,
  COUNT(value)
FROM formatted_data
  GROUP BY 1
  ORDER BY 1;
</code></pre><p>In the <code>bucket_settings</code> CTE, we use <code>::integer</code> to cast any value there as an integer. We do this since we will want to compare NULL against other integers later. If we don’t cast NULLs then the SQL will fail.<p>Now, we will use a CTE called <code>calculated_bucket_settings</code> to set a dynamic range if the static range is not defined. This will let the data specify the values if they are not defined by the <code>bucket_settings</code>:<pre><code class=language-sql>WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
	SELECT
		5 AS bucket_count,
		null::integer AS min_value, -- can be null or an integer
		null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
		(SELECT bucket_count FROM bucket_settings) AS bucket_count,
		COALESCE(
			(SELECT min_value FROM bucket_settings),
			(SELECT min(value) FROM formatted_data)
		) AS min_value,
		COALESCE(
			(SELECT max_value FROM bucket_settings),
			(SELECT max(value) + 1 FROM formatted_data)
		) AS max_value
), histogram AS (
  SELECT
     WIDTH_BUCKET(value, min_value, max_value, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
)

SELECT
   bucket,
   frequency,
   CONCAT(
     (min_value + (bucket - 1) * (max_value - min_value) / bucket_count)::INT,
     ' - ',
     (((min_value + bucket * (max_value - min_value) / bucket_count)) - 1)::INT) AS range
FROM histogram, calculated_bucket_settings;
</code></pre><p>In the <code>histogram</code> CTE, we use <code>max_value + 1</code> because the range of values is treated as an exclusive range. Also, because we are working with integers, when you create the pretty label for the <code>range</code>, we subtracted 1 from the maximum value for the range to reduce confusion from what would appear to be overlapping ranges. This decision fits into the “good-enough for business intelligence” caveats listed above. We could have changed the label logic to be <code>75 &#60= value &#60 94</code> in lieu of the subtraction, but most folks like it see the dash instead of math logic for a histogram.<p>The query above will give results like the following:<pre><code class=language-sql>bucket   | frequency |  range
---------+-----------+---------
       1 |         3 | 1 - 18
       3 |         4 | 38 - 55
       4 |         1 | 56 - 74
       5 |         1 | 75 - 93
(4 rows)
</code></pre><p>Now we see that all buckets and frequencies are not represented. So, if a value is empty, we need to fill in the frequency with a zero. This is where SQL requires thinking in sets. We can use <code>generate_series</code> to generate all values for the buckets, then join the histogram to all values. Flipping the order of the query around makes it simpler than joining an incomplete set. In the following query, we’ve built out the buckets in the <code>all_buckets</code> CTE, then joined that to the histogram values:<pre><code class=language-sql>WITH formatted_data AS (
  SELECT * FROM (VALUES (13), (42), (18), (62), (93), (47), (51), (41), (1)) AS t (value)
), bucket_settings AS (
  SELECT
        5 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
    COUNT(value) AS frequency
  FROM formatted_data, calculated_bucket_settings
  GROUP BY 1
  ORDER BY 1
 ), all_buckets AS (
  SELECT
    fill_buckets.bucket AS bucket,
    FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
    FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
  FROM calculated_bucket_settings,
	  generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;
</code></pre><p>Try modifying the values in the <code>bucket_settings</code> CTE to see how the histogram responds. By increasing the <code>bucket_count</code>, <code>min_value</code>, or <code>max_value</code>, you’ll see the histogram respond appropriately. If you modify the range to exclude values, using the <code>FULL OUTER JOIN</code>, you’ll see that all non-classified items are bucketed as “out of bounds”.<p>Using a presentation tool, display the histogram as a bar chart (shown below using <a href=https://metabase.com>Metabase</a>):<p><img alt="postgres histogram"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/fb95ee22-4d7d-40c5-3d2c-74a610cbd000/public><h2 id=real-life-data-with-histograms><a href=#real-life-data-with-histograms>Real Life Data with Histograms</a></h2><p>Now that we have a really nice auto-adjusting query, we can simply build a histogram from other examples. I have a little experimental database from the <a href=https://aact.ctti-clinicaltrials.org/download>database of clinical trials</a>.<p>What if we wanted to build a histogram for the count of participants in various clinical trial studies? To start, build the query that finds the number of participants for each study:<pre><code class=language-sql>SELECT
	outcomes.nct_id,
	max(outcome_counts.count) AS value
FROM outcomes
INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
WHERE param_type = 'COUNT_OF_PARTICIPANTS'
GROUP BY 1
</code></pre><p>We can take the above query, and place it in the <code>formatted_data</code> CTE:<pre><code class=language-sql>WITH formatted_data AS (
	SELECT
		outcomes.nct_id,
		MAX(outcome_counts.count) AS value
	FROM outcomes
	INNER JOIN outcome_counts ON outcomes.id = outcome_counts.outcome_id
	WHERE param_type = 'COUNT_OF_PARTICIPANTS'
	GROUP BY 1
), bucket_settings AS (
  SELECT
        20 AS bucket_count,
        null::integer AS min_value, -- can be null or an integer
        null::integer AS max_value -- can be null or an integer
), calculated_bucket_settings AS (
	SELECT
	  (SELECT bucket_count FROM bucket_settings) AS bucket_count,
	  COALESCE(
	          (SELECT min_value FROM bucket_settings),
	          (SELECT min(value) FROM formatted_data)
	  ) AS min_value,
	  COALESCE(
	          (SELECT max_value FROM bucket_settings),
	          (SELECT max(value) + 1 FROM formatted_data)
	  ) AS max_value
), histogram AS (
  SELECT
    WIDTH_BUCKET(value, calculated_bucket_settings.min_value, calculated_bucket_settings.max_value + 1, (SELECT bucket_count FROM bucket_settings)) AS bucket,
     COUNT(value) AS frequency
   FROM formatted_data, calculated_bucket_settings
   GROUP BY 1
   ORDER BY 1
 ), all_buckets AS (
   SELECT
     fill_buckets.bucket AS bucket,
     FLOOR(calculated_bucket_settings.min_value + (fill_buckets.bucket - 1) * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS min_value,
     FLOOR(calculated_bucket_settings.min_value + fill_buckets.bucket * (calculated_bucket_settings.max_value - calculated_bucket_settings.min_value) / (SELECT bucket_count FROM bucket_settings)) AS max_value
   FROM calculated_bucket_settings,
	   generate_series(1, calculated_bucket_settings.bucket_count) AS fill_buckets (bucket))

 SELECT
   all_buckets.bucket AS bucket,
   CASE
   WHEN all_buckets IS NULL THEN
	   'out of bounds'
	 ELSE
     CONCAT(all_buckets.min_value, ' - ', all_buckets.max_value - 1)
   END AS range,
   SUM(COALESCE(histogram.frequency, 0)) AS frequency
 FROM all_buckets
 FULL OUTER JOIN histogram ON all_buckets.bucket = histogram.bucket
 GROUP BY 1, 2
 ORDER BY bucket;
</code></pre><p>The query will output the following. This is a bit un-desirable because the distribution is concentrated in the first bucket:<pre><code class=language-sql> bucket |       range       | frequency
--------+-------------------+-----------
      1 | 1 - 359943        |     23261
      2 | 359944 - 719886   |         3
      3 | 719887 - 1079829  |         1
      4 | 1079830 - 1439773 |         0
      5 | 1439774 - 1799716 |         1
      6 | 1799717 - 2159659 |         0
      7 | 2159660 - 2519602 |         0
      8 | 2519603 - 2879546 |         0
      9 | 2879547 - 3239489 |         0
     10 | 3239490 - 3599432 |         0
     11 | 3599433 - 3959375 |         0
     12 | 3959376 - 4319319 |         0
     13 | 4319320 - 4679262 |         0
     14 | 4679263 - 5039205 |         0
     15 | 5039206 - 5399148 |         0
     16 | 5399149 - 5759092 |         0
     17 | 5759093 - 6119035 |         0
     18 | 6119036 - 6478978 |         0
     19 | 6478979 - 6838921 |         0
     20 | 6838922 - 7198865 |         1
(20 rows)

</code></pre><p>If you’ve loaded the data, to improve the presentation, we can adjust the <code>bucket_settings</code> CTE to modify how the buckets are defined. For instance, with this dataset, if we changed the bucket settings to:<pre><code class=language-sql>  SELECT
        20 AS bucket_count,
        0::integer AS min_value, -- can be null or an integer
        100::integer AS max_value -- can be null or an integer
</code></pre><p>It outputs a much nicer distribution of data:<pre><code class=language-sql> bucket |     range     | frequency
--------+---------------+-----------
      1 | 0 - 49        |     13584
      2 | 50 - 99       |      3612
      3 | 100 - 149     |      1720
      4 | 150 - 199     |       942
      5 | 200 - 249     |       645
      6 | 250 - 299     |       477
      7 | 300 - 349     |       338
      8 | 350 - 399     |       237
      9 | 400 - 449     |       176
     10 | 450 - 499     |       137
     11 | 500 - 549     |       150
     12 | 550 - 599     |       101
     13 | 600 - 649     |        77
     14 | 650 - 699     |        58
     15 | 700 - 749     |        61
     16 | 750 - 799     |        41
     17 | 800 - 849     |        41
     18 | 850 - 899     |        33
     19 | 900 - 949     |        36
     20 | 950 - 999     |        43
        | out of bounds |       758
</code></pre><h2 id=in-brief><a href=#in-brief>In brief</a></h2><ul><li>Using Postgres <code>width_bucket</code> will build buckets to gather frequency values to create histograms.<ul><li>Creating a function assigns values to predefined buckets based on a min/max range and bucket count.<li>By casting, you can work with data that contains some null values<li>You can create values that fall outside the defined range</ul><li>By using Common Table Expressions (CTEs), you can define bucket settings dynamically with auto-adjusting bins based on the dataset.<li>Histograms can aid with the visualization of data and data distribution in your set. Histograms show how frequently data points appear within specific ranges (bins), making it easier to understand patterns, trends, and outliers. Bin size does affect interpretation so choosing the right number of bins is crucial; too few can oversimplify the data, while too many can create noise and obscure trends.</ul><p>Build an interesting histogram? Show us <a href=https://x.com/crunchydata>@crunchydata</a>! ]]></content:encoded>
<category><![CDATA[ Analytics ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">62085653255fdab2276832f69926de399f4b9c4e76871d17191be67b2a96104d</guid>
<pubDate>Fri, 04 Apr 2025 10:00:00 EDT</pubDate>
<dc:date>2025-04-04T14:00:00.000Z</dc:date>
<atom:updated>2025-04-04T14:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ 8 Steps in Writing Analytical SQL Queries ]]></title>
<link>https://www.crunchydata.com/blog/8-steps-in-writing-analytical-sql-queries</link>
<description><![CDATA[ Chris breaks down his approach to building out complex SQL step by step. ]]></description>
<content:encoded><![CDATA[ <p>It is never immediately obvious how to go from a simple SQL query to a complex one -- especially if it involves intricate calculations. One of the “dangers” of SQL is that you can create an executable query but return the wrong data. For example, it is easy to inflate the value of a calculated field by joining to multiple rows.<p>Use Crunchy Playground to follow allow with this blog post using a Postgres terminal:<p><a href="https://www.crunchydata.com/developers/playground?sql=https://gist.githubusercontent.com/Winslett/328f3332f3a54ef5ea5f70f8fe72afb3/raw/09387ac8cd5075b1f8179a25b951870b97a4e84e/fake-data.sql">Postgres Playground w/ Sample Data</a><p>Let’s take a look at a sample query. This appears to look for a summary total of invoice amounts across teams. If you look closely, you might see that the joins would inflate a team’s yearly invoice spend for each team member.<pre><code class=language-sql>SELECT
	teams.id,
	json_agg(accounts.email),
	SUM(invoices.amount)
FROM teams
	INNER JOIN team_members ON teams.id = team_members.team_id
	INNER JOIN accounts ON teams.id = team_members.team_id
	INNER JOIN invoices ON teams.id = invoices.team_id
WHERE lower(invoices.period) > date_trunc('year', current_date)
GROUP BY 1;

</code></pre><p>The query is joining <code>invoices</code> to <code>teams</code> after already joining <code>team_members</code> to <code>teams</code>. If a team has multiple members and multiple invoices, each invoice amount could be counted multiple times in the <code>SUM(invoices.amount)</code> calculation.<h2 id=building-sql-from-the-ground-up><a href=#building-sql-from-the-ground-up>Building SQL from the ground up</a></h2><p>The above error may not be immediately obvious. This is why it’s better to start small and use building blocks.<p>Writing complex SQL isn’t as much “writing a query” as it is “building a query.” By combining the building blocks, you get the data that you think you are getting. To write a complex query, loop through the following steps until you get to the intended data:<ol><li>Using words, define the data<li>Investigate available data<li>Return the simplest data<li>Confirm the simple data<li>Augment the data with joins<li>Perform summations<li>Augment with details or aggregates<li>Debugging</ol><p>Let’s step through this above query example, getting sum aggregate totals, to learn my method for building a query.<h3 id=step-1-in-human-words-write-what-you-want><a href=#step-1-in-human-words-write-what-you-want>Step 1: In human words, write what you want</a></h3><p>Write a description, and know it is okay if it changes. Data exploration may mean the data is different than expected. But, it’s a starting point. I usually do this by adding a comment at the top of the editor:<pre><code>-- Return all teams, email addresses for the team, and the
-- year-to-date total spend
</code></pre><h3 id=step-2-investigate-the-data-in-the-tables><a href=#step-2-investigate-the-data-in-the-tables>Step 2: Investigate the data in the tables</a></h3><p>Even when familiar with the date set, I spend time to ensure the data has not changed. First, if using <code>psql</code>, list the tables:<pre><code>psql> \dt
psql> \d invoices
</code></pre><p>There are many SQL clients, and all of them should enable listing and viewing tables and table structures. To further inspect, write a simple query to sample the data:<pre><code class=language-sql>SELECT * FROM invoices;
</code></pre><p>Try this on a few different tables. By inspecting column names and columns data, I can see a pattern of relationships. When exploring a dataset created by someone else, it can be difficult to determine relationships. Data isn’t always clean. Columns may not be incorrectly named. "Magic strings" and "magic integers" may not make sense. Multiple application developers implement different philosophies with data structures.<p>To verify table structures, I take a two step approach: 1) compare it to known data, and 2) ask people involved with the project. When asking a person about the structure of data, they will never respond with "yes" or "no" -- the data structure always has a story. It’s important to verify relationships -- it is possible to join two non-related fields.<h3 id=step-3-find-the-simplest-data-first><a href=#step-3-find-the-simplest-data-first>Step 3: Find the simplest data first</a></h3><p>In this scenario, the easiest data is returning the invoice. We also want to calculate the team spend for the year. First, reduce to invoices that should go into the calculations:<pre><code class=language-sql>SELECT
	*
FROM invoices
WHERE lower(period) > date_trunc('year', current_date);
</code></pre><p>Look over the data and confirm the returned rows match expected data: included and excluded. When viewing the data, add a where conditional for attributes that should be excluded. A common issue with missing rows on conditionals is NULL values. The following conditional will also exclude when <code>deleted_at</code> is <code>NULL</code>:<pre><code class=language-sql>AND deleted_at &#60 date_trunc('year', current_date)
</code></pre><p>To include <code>NULL</code> values, the conditional will need to be expanded to:<pre><code class=language-sql>AND (deleted_at &#60 date_trunc('year', current_date) OR
deleted_at IS NULL)
</code></pre><h3 id=step-4-confirm-the-simple-data><a href=#step-4-confirm-the-simple-data>Step 4: Confirm the simple data</a></h3><p>When working through complex queries that require precision like financial reports, you may need to audit the results row by row. Step through each row and confirm the results. Then, step through a known set of good data and ensure data is not missing. Many mis-written SQL queries are found via this 2-sided verification.<h3 id=step-5-add-joins-but-do-not-add-calculations-yet><a href=#step-5-add-joins-but-do-not-add-calculations-yet>Step 5: Add joins, but do not add calculations yet</a></h3><p>Start with the most reasonable <a href=https://www.crunchydata.com/developers/playground/joins-in-postgres>joins</a> first. This being an example, the idea that we don't know the data is false. In the real world, this step requires additional experimentation and data validation from team members:<pre><code class=language-sql>SELECT
	*
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
WHERE lower(period) > date_trunc('year', current_date);
</code></pre><p>After adding the joins, run the query and inspect the data. By joining the data, the query is returning more columns. Start limiting the response to the columns to be used. Remove the <code>*</code> and go with column names:<pre><code class=language-sql>SELECT
	invoices.period,
	invoices.amount,
	teams.id,
	teams.name
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
WHERE lower(period) > date_trunc('year', current_date);
</code></pre><p>Once that works, add additional joins until it breaks. In this example, experiment by adding <code>team_members</code>:<pre><code class=language-sql>SELECT
	invoices.period,
	invoices.amount,
	teams.id,
	teams.name
FROM invoices
	INNER JOIN teams ON invoices.team_id = teams.id
	INNER JOIN team_members ON teams.id = team_members.team_id
WHERE lower(period) > date_trunc('year', current_date);
</code></pre><p>But that has duplicate rows -- previously, the query returned 602 rows and now it returns 3749 rows. Why? When joining teams and team_members, one-to-many relationship adds one row for each additional team member. In this case, we would step back to go forward. Remove the latest value and encapsulate the value.<p>Common issues during this phase include:<ul><li>typos in join conditional -- when working with tables with similar names, it is easy to insert an incorrect join condition. For instance, the following query will execute, and will return completely the wrong data, can you spot the error?</ul><pre><code class=language-sql>SELECT
	invoices.period, invoices.amount, teams.id, teams.name
FROM invoices
	INNER JOIN teams ON invoices.id = teams.id
WHERE lower(period) > date_trunc('year', current_date);
</code></pre><p>The other question is: what kind of join should I use? Quick refresher:<ul><li><code>INNER JOIN</code> is an exclusive join. Only rows with matching rows in the joined table, then the value is not returned.<li><code>LEFT JOIN</code> is a non-exclusive join. All rows from the previously requested table will be returned, and the joined table will be returned if it exists<li><code>OUTER JOIN</code> all rows from all tables will be returned, if unfound the other table will return NULL.</ul><h3 id=step-6-perform-summations><a href=#step-6-perform-summations>Step 6: Perform summations</a></h3><p>Let’s rewind back to what works, and package it into a <a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>CTE</a> that we can use as a join. As you make changes, you'll make some wrong steps -- that is common. Know how to get back to a working query. Often that requires undo-ing changes to a working state.<p>Once I get to a working state, then I package the bit of data into a CTE (or common table expression):<pre><code class=language-sql>WITH team_yearly_spend AS (
	SELECT
		invoices.period AS invoice_period,
		invoices.amount AS invoice_amount,
		teams.id AS team_id,
		teams.name AS team_name
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
)

SELECT * FROM team_yearly_spend;
</code></pre><p>Notice the use of <code>AS</code> to declare unique names for a column. When building complex queries, I favor verbosity to limit mistakes.<p>Let's add aggregations to the CTE:<pre><code class=language-sql>WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name AS team_name,
		SUM(invoices.amount) AS team_yearly_spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
	GROUP BY 1
)

SELECT
	*
FROM team_yearly_spend;
</code></pre><h3 id=step-7-lastly-augment-data-to-include-details><a href=#step-7-lastly-augment-data-to-include-details>Step 7: Lastly, augment data to include details</a></h3><p>To include team member emails as specified at the beginning, we will join the team members to the statement outside the CTE:<pre><code class=language-sql>WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name,
		SUM(invoices.amount) AS spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
	GROUP BY 1
)

SELECT
	team_yearly_spend.team_id,
	team_yearly_spend.spend,
	COUNT(DISTINCT accounts.id) AS accounts_count,
	JSON_AGG(accounts.email) AS account_emails
FROM team_yearly_spend
LEFT JOIN team_members ON team_yearly_spend.team_id = team_members.team_id
LEFT JOIN accounts ON team_members.account_id = accounts.id
GROUP BY 1, 2
;
</code></pre><h3 id=step-8-debugging><a href=#step-8-debugging>Step 8: Debugging</a></h3><p>To debug output errors, I find it easier to remove the calculations to get to the raw data. When using a query editing tool that allows running of the a visually selected query (like DBeaver), I comment out the aggregations and add a <code>*</code> to return more values:<pre><code class=language-sql>-- WITH team_yearly_spend AS (
	SELECT
		teams.id AS team_id,
		teams.name,
		*
--		SUM(invoices.amount) AS spend
	FROM invoices
		INNER JOIN teams ON invoices.team_id = teams.id
	WHERE lower(period) > date_trunc('year', current_date)
--	GROUP BY 1
--)
</code></pre><p>With this response, look for:<ul><li>rows duplicated by joins,<li>rows that should be present, yet are missing due to a bad conditional,<li>rows that are included that should be filtered out with a conditional.</ul><p>Debugging SQL queries is a simple process, but it’s not an easy process. It requires a data audit, usually best to compare against a known value.<h2 id=why-is-sql-complex><a href=#why-is-sql-complex>Why is SQL complex?</a></h2><p>The schema for the example above was an example of an application data structure with OLTP in mind. The SQL that we have just written can use that schema and generate values for report generation or for display to application users. That is the great thing about SQL -- no matter how the underlying structure is represented, we can get the data we want to get out of it.<p>SQL is powerful because it’s built using simple, standardized blocks of logic.<p>Writing SQL is a non-linear process. I've never seen someone start at the top of a long-SQL query and type through to the end. It is a process that involves multiple levels of extraction, verification, and summation. ]]></content:encoded>
<category><![CDATA[ Analytics ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">f746bda5b29206e87ce36bc19248848e55b8677f228b18df6d77e09e0c116490</guid>
<pubDate>Fri, 08 Nov 2024 09:30:00 EST</pubDate>
<dc:date>2024-11-08T14:30:00.000Z</dc:date>
<atom:updated>2024-11-08T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ 4 Ways to Create Date Bins in Postgres: interval, date_trunc, extract, and to_char ]]></title>
<link>https://www.crunchydata.com/blog/4-ways-to-create-date-bins-in-postgres-interval-date_trunc-extract-and-to_char</link>
<description><![CDATA[ Chris has lots of tips and sample code for getting date based report data from Postgres. He is rolling up days, weeks, months, and quarters and even has handy functions for labeling date results in your preferred format. ]]></description>
<content:encoded><![CDATA[ <p>You followed all the best practices, your sales dates are stored in perfect timestamp format …. but now you need to get reports by day, week, quarters, and months. You need to bin, bucket, and roll up sales data in easy to view reports. Do you need a BI tool? Not yet actually. Your Postgres database has hundreds of functions that let you query data analytics by date. By using some good old fashioned SQL - you have powerful analysis and business intelligence with date details on any data set.<p>In this post, I’ll walk through some of the key functions querying data by date.</p><!--more--><p>For a summary of the best ways to store date and time in Postgres, see <a href=https://www.crunchydata.com/blog/working-with-time-in-postgres>Working with Time in Postgres</a>. We also have <a href=https://www.crunchydata.com/developers/playground/postgres-date-functions>interactive web based tutorial</a> with lots of sample code for working with data by date, with sample data set of ecommerce orders.<h2 id=interval---the-swiss-army-knife-of-date-manipulation><a href=#interval---the-swiss-army-knife-of-date-manipulation>Interval - the Swiss-army knife of date manipulation</a></h2><p>The <code>interval</code> is a data type used to modify other times. For instance, an interval can be added or subtracted from a known time. Interval is super handy and the first place you can go to quickly summarize data by date. Like a Swiss-army knife, it’s not always the best tool for the job, but it can be used in a pinch. Let’s talk about where it excels.<p>How can we run a query that returns the total sum of orders for the last 90 days? Of course, interval can be used. Without interval, we often see people using a date variable passed from an external source that has generated a date. Using <code>now() - INTERVAL '90 days'</code>, you can use the same query no matter the date. The other secret sauce is the use of <code>now()</code> which is a timestamp for the current time on the server.<pre><code class=language-sql>SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= NOW () - INTERVAL '90 days';
</code></pre><pre><code class=language-sql>    sum
-----------
 259472.99
(1 row)
</code></pre><p>Instead of using <code>now()</code>, <code>current_date</code> can be used to return a date instead of a time.<pre><code class=language-sql>SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= current_date - INTERVAL '90 days';
</code></pre><p>These two queries are different — <code>current_date</code> starts at the beginning of the day, and <code>now()</code> will include a time throughout the day. When using <code>now()</code> the results will match only those that occurred after the current time 90 days ago.<p>Commonly, people use a shorter form for intervals using cast, but it’s the same query:<pre><code class=language-sql>SELECT
  SUM(total_amount)
FROM
  orders
WHERE
  order_date >= NOW() - '90 days'::interval;
</code></pre><p><strong>Using interval for binning</strong><p>To create interval ranges, we can combine the use of <code>CASE</code> with <code>interval</code>. SQL’s <code>CASE</code> performs conditional logic within queries. The format for <code>CASE</code> is <code>WHEN .. THEN</code> , below is a query that executes a sample case statement:<pre><code class=language-sql>SELECT
  CASE
    WHEN false THEN 'not this'
    WHEN true THEN 'this will show'
    ELSE 'never makes it here'
  END;
</code></pre><p>Now, let’s categorize orders into the time ranges: "30-60 days ago", "60-90 days ago"<pre><code class=language-sql>SELECT
    CASE
        WHEN order_date BETWEEN (NOW() - INTERVAL '60 days') AND (NOW() - INTERVAL '30 days')
            THEN '30-60 days ago'
        WHEN order_date BETWEEN (NOW() - INTERVAL '90 days') AND (NOW() - INTERVAL '60 days')
            THEN '60-90 days ago'
    END AS date_range,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_sales
FROM
  orders
WHERE
  order_date BETWEEN (NOW() - INTERVAL '90 days') AND (NOW() - INTERVAL '30 days')
GROUP BY
  date_range
ORDER BY
  date_range;
</code></pre><pre><code class=language-sql>   date_range   | total_orders | total_sales
----------------+--------------+-------------
 30-60 days ago |          160 |   101754.20
 60-90 days ago |          128 |    88086.24
</code></pre><p>This may look a bit complicated, but the conditional for the statement is <code>order_date BETWEEN begining_date_value AND ending_date_value</code> . Since <code>CASE</code> statements end after the first truthy conditional, we can simplify this a bit more:<pre><code class=language-sql>SELECT
    CASE
	    WHEN order_date >= NOW() - '30 days'::interval THEN '00-30 days ago'
	    WHEN order_date >= NOW() - '60 days'::interval THEN '30-60 days ago'
	    ELSE
		    '60-90 days ago'
	  END AS date_range,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_sales
FROM
  orders
WHERE
  order_date >= NOW() - '90 days'::interval
GROUP BY
  date_range
ORDER BY
  date_range;
</code></pre><p>It’s best to choose a pattern depending on how explicit you want to be with your SQL queries. Using <code>BETWEEN</code> is more explicit, and may be best for teams choosing more explicit queries. The hard part about using <code>INTERVAL</code> is that recent time is greater than older time — so the <code>>=</code> may break the brains of those who haven’t used a lot of time manipulation.<p>In summary: use <code>interval</code> for binning continuous time.<h2 id=date_trunc---the-easiest-function-for-date-binning><a href=#date_trunc---the-easiest-function-for-date-binning>date_trunc - the easiest function for date binning</a></h2><p>Use <code>date_trunc</code> for binning of pre-defined time: like day, week, month, quarter, and year. Where interval logic can be complicated, <em>date_trunc</em> is dead simple.<p>At a glance, <em>date_trunc’s</em> name might indicate that its about formatting, but it is more powerful when combined with <code>GROUP BY</code>. <em>date_trunc</em> is an essential part of the query toolkit when working with analytics. Simple uses of date_trunc is like the following:<pre><code class=language-sql>/* show the beginning of the first day of the month */
SELECT date_trunc('month', current_date);

/* show the beginning of the first day of the week */
SELECT date_trunc('week', current_date);

/* show the beginning of the first day of the year */
SELECT date_trunc('year', current_date);

/* show the beginning of the first day of the current quarter */
SELECT date_trunc('quarter', current_date);
</code></pre><p>To generate a date bin, extract the period of time from the record’s date. For instance, let’s write a query to show the monthly number of orders and total order sales:<pre><code class=language-sql>SELECT
  date_trunc ('month', order_date) AS month,
  COUNT(*) AS total_orders,
  SUM(total_amount) AS monthly_total
FROM
  orders
GROUP BY 1
ORDER BY
  month;
</code></pre><p>Results would look like:<pre><code class=language-sql>        month        | total_orders | monthly_total
---------------------+--------------+---------------
 2024-08-01 00:00:00 |           11 |       2699.82
 2024-09-01 00:00:00 |           39 |       8439.41
(2 rows)
</code></pre><p>Using <code>GROUP BY</code> , Postgres counts and summates based on the unique values returned by the the <code>date_trunc</code> function. The available bins for <code>date_trunc</code> are: millennium, century, decade, year, quarter, week, day, hour, minute, second, millisecond.<h2 id=extract---sometimes-you-have-to-do-something-funky><a href=#extract---sometimes-you-have-to-do-something-funky>Extract - sometimes you have to do something funky</a></h2><p>Not all dates are nicely broken into days, months, years, etc. The <code>extract</code> function extracts a specific value for a date / time type. For instance, I commonly use <em>extract</em> for the following:<pre><code class=language-sql>/* returns the epoch value for a date / time    */
/* I this use to send date values to Javascript */
SELECT extract('epoch' from current_date);

/* returns the hour from a time type */
SELECT extract('hour' from now());
</code></pre><p>How can this be used to bin values? For example, if you wanted to find which hours of which day of the week has the highest number and sales value of orders:<pre><code class=language-sql>SELECT
    extract('dow' from order_date) AS day_of_week,
    extract('hour' from order_date) AS hour,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS monthly_total
FROM
    orders
GROUP BY 1, 2
ORDER BY 1, 2;
</code></pre><pre><code class=language-sql> day_of_week | hour | total_orders | monthly_total
-------------+------+--------------+---------------
           0 |   23 |           35 |      23631.56
           1 |    0 |           31 |      19299.88
</code></pre><p>You'll see here Sunday is '0` and Saturday is '6'.<p>Where <code>date_trunc</code> keeps the higher context, <code>extract</code> removes all context except that which is requested.<h2 id=to_char---extreme-makeover-date-edition><a href=#to_char---extreme-makeover-date-edition>to_char - extreme makeover date edition</a></h2><p>It’s awkward because <code>to_char</code> is both the most versatile and most hated function for date binning. The function will accept time / date, text, or numbers for additional formatting, so it’s not explicitly for date functions. It’s never failed, when I’ve used to_char, someone has told me that I could have used a better function. It can produce human readable values quickly, but it’s unsuited for data sent for additional machine processing.<p>Here are a few examples of <code>to_char</code>:<pre><code class=language-sql>/* extract current day of week and current hour of day based on UTC */
SELECT to_char(now(), 'DayHH24');

/* extract current day of week and current hour of day based on NYC time zone */
SELECT to_char(now() AT TIME ZONE 'America/New_York' , 'DayHH24');
</code></pre><p>This outputs the current day of the week, and current hour based on UTC time. This breaks your brain right? What does the “DayHH24” portion mean? Postgres documentation has a long list for <a href=https://www.postgresql.org/docs/17/functions-formatting.html#FUNCTIONS-FORMATTING-DATETIME-TABLE>reserved strings used by to_char</a>:<p>To change the presentation of a month, using to_char to extract and format the name and year:<pre><code class=language-sql>SELECT to_char(order_date, 'FMMonth YYYY') AS formatted_month,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS monthly_total
FROM
    orders
GROUP BY 1
ORDER BY 1;
</code></pre><pre><code class=language-sql> formatted_month | total_orders | monthly_total
-----------------+--------------+---------------
 August 2024     |           11 |       2699.82
 September 2024  |           39 |       8439.41
</code></pre><p><strong>Escaping reserved strings in <code>to_char</code>:</strong><p>The common format for quarters in finance is “Q1” / “Q2” / “Q3” and “Q4”. Using <code>to_char</code>, we can extract the quarter for a time in that format. But, the “Q” is a reserved keyword for quarter. To print a “Q” without evaluating it, wrap it in double quotes:<pre><code class=language-sql>SELECT
    to_char(order_date, '"Q"Q-YYYY') AS formatted_quarter,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY 1
ORDER BY 1;
</code></pre><pre><code class=language-sql> formatted_quarter | total_amount
-------------------+--------------
 Q1-2022           |    313872.84
 Q1-2023           |    282774.15
 Q1-2024           |    287379.33
</code></pre><h2 id=summary><a href=#summary>Summary</a></h2><p>Binning is an essential tool for faceting the data for financial reports and data analysis. Dates and times are a more complex piece of information than they first appear — hours, months, hours, quarters, years. So, a single date can be facetted many ways.<p>Luckily, Postgres has the functions you need to work with dates. For a quick summary:<p><strong>interval</strong> - modifies date / times by adding / subtracting<p><strong>date_trunc -</strong> truncates a date / time — essentially rounding-down to the closest value<p><strong>extract</strong> - extracts a single piece of information from a date / time (day, week, month, quarter, year)<p><strong>to_char</strong> - formats output into a specific style of date format or text string. ]]></content:encoded>
<category><![CDATA[ Analytics ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">132ba3fcf14753154c6b4688d112659ac3e430a333412042bef4f0d67927f6c2</guid>
<pubDate>Tue, 29 Oct 2024 10:30:00 EDT</pubDate>
<dc:date>2024-10-29T14:30:00.000Z</dc:date>
<atom:updated>2024-10-29T14:30:00.000Z</atom:updated></item>
<item><title><![CDATA[ Using acts_as_tenant for Multi-tenant Postgres with Rails ]]></title>
<link>https://www.crunchydata.com/blog/using-acts_as_tenant-for-multi-tenant-postgres-with-rails</link>
<description><![CDATA[ Chris walks through using the acts_as_tenant gem. He shows some example code to get started with this gem, how to migrate, and other tips for working with B2B or multi-tenant applications. ]]></description>
<content:encoded><![CDATA[ <p>Since its launch, Ruby on Rails has been a preferred open source framework for small-team B2B SaaS companies. Ruby on Rails uses a conventions-over-configuration mantra. This approach reduces common technical choices, thus elevating decisions. With this approach, the developers get an ORM (ActiveRecord), templating engine (ERB), helper methods (like <code>number_to_currency</code>), controller (ActiveController), directory setup defaults (<code>app/{models,controllers,views}</code>), authentication methods (<code>has_secure_password</code>), and more.<p><a href=https://www.crunchydata.com/blog/designing-your-postgres-database-for-multi-tenancy>Multi-tenant</a> is the backbone of B2B SaaS products, yet core-Rails remains un-opinionated on multi-tenant implementations. Through the years, there has been many different Ruby gem implementations for multi-tenant. Many of these gems were built for complicated situations — either adapting to scaling needs or regulated industries that require physical separation of data. Many of these gems required deep integration with your Rails application code.<h2 id=enter-acts_as_tenant><a href=#enter-acts_as_tenant>Enter acts_as_tenant</a></h2><p>With all that as context, the <code>acts_as_tenant</code> gem is super simple. <a href=https://github.com/ErwinM/acts_as_tenant><code>acts_as_tenant</code></a> has recently released version 1.0 after 12 years of development — so it’s not new. The gem implements multi-tenant best-practices by augmenting Rails’ ActiveRecord ORM:<ul><li>protects developers from building queries that return other tenant’s records<li>requires a <code>tenant_id</code> on the tables for models specific to a tenant<li>adds the <code>tenant_id</code> scope to the query<li>includes ActionController, ActiveRecord, ActiveJob helpers to insert new records with the scoped tenant</ul><p>Acts_as_tenant is built for row-level multi-tenancy, and that is it. So, no need to manage multiple databases or schemas for data structures — it keeps it simple. One of the best things I can say about <code>acts_as_tenant</code> is that it can be implemented by an existing application code-base. Too many times, with the older multi-tenant gems, the implementation was invasive, and thus required complex refactoring.<p><strong>What it’s not:</strong> acts_as_tenant is not for account-based sharding — either schema-based or multi-cluster based sharding. It’s purely for multi-tenant safety.<h2 id=for-the-paranoid><a href=#for-the-paranoid>For the paranoid</a></h2><p>I have built a few multi-tenant apps in industries with data regulation (think finance and education). I am overly cautious when building multi-tenant apps — so this guardrail is my favorite.<p>To enforce the <code>tenant_id</code> on every ActiveRecord query within an application, add the following to a initializer file in <code>config/initializers/acts_as_tenant.rb</code>:<pre><code class=language-ruby>ActsAsTenant.configure do |config|
	config.require_tenant = true
end
</code></pre><p>Having worked in a few multi-tenant apps where showing data to another customer is consequential, I wish <code>acts_as_tenant</code> had an enforcing requirement of a <code>tenant_id</code> for queries. One of the apps I wrote required high-performance, large-scale data loads. We had an intermittent bug where people would be assigned to the incorrect tenant. After tracking down the bug, we found the incident in the implementation of multiple <code>external_ids</code>:<pre><code class=language-sql>-- bug code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND external_id = $2 OR other_external_id = $2;

-- correct code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND (external_id = $2 OR other_external_id = $2);
</code></pre><p>The lesson: wrap your <code>OR</code> statements in parenthesis. The bug code interpreted as:<pre><code class=language-sql>(tenant_id = %1 AND external_id = $2) OR other_external_id = $2;
</code></pre><p>When using acts_as_tenant, you can avoid this bug when using ActiveRecord models. Below, you’ll see that ActiveRecord encapsulates the following:<p><img alt="active record output"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/7e4070b3-c6bc-48e2-09fb-bd4b55f1ff00/public><p>Remember, if you choose to use raw SQL, you’ll need to keep your guard up.<h2 id=testing-from-rails-new-app><a href=#testing-from-rails-new-app>Testing from <code>rails new app</code></a></h2><p>To install from a new rails application, do the following:<ol><li>Run <code>rails new multi-tenant-app</code><li>Decide on your application’s tenant model: typically <code>Organization</code> or <code>Account</code> or <code>Team</code> or <code>School</code>. Use the underscore version of the name with <code>_id</code> appended as your tenant id for all columns, such as <code>organization_id</code> or <code>account_id</code> or <code>team_id</code> or <code>school_id</code>. Below, we will use the tenant name <code>Account</code>.<li>Add <code>gem "acts_as_tenant"</code> to <code>Gemfile</code>, and run <code>bundle install</code>.<li>Create some models:</ol><pre><code class=language-bash>rails g model Account name:string
rails g model User email:string account_id:integer
rails g model Post content:string user_id:integer account_id:integer

rails db:create &#38&#38 rails db:migrate
</code></pre><ol start=5><li>Add the following to <code>app/models/account.rb</code></ol><pre><code class=language-ruby>class Account &#60 ApplicationRecord

  has_many :users
  has_many :posts

end
</code></pre><ol start=6><li>Add the following to <code>app/models/post.rb</code>:</ol><pre><code class=language-ruby>class Post &#60 ApplicationRecord

  belongs_to :user
  acts_as_tenant :account

end
</code></pre><ol start=7><li>Add the following to <code>app/models/user.rb</code>:</ol><pre><code class=language-ruby>class User &#60 ApplicationRecord
  acts_as_tenant :account
  validates_uniqueness_to_tenant :email
end
</code></pre><ol start=8><li>Now, let’s experiment with the Rails REPL:</ol><pre><code class=language-bash>rails console
</code></pre><p>Then, you can run the following commands:<pre><code class=language-ruby>first_account = Account.create!(name: "First Account")
last_account = Account.create!(name: "Last Account")

ActsAsTenant.with_tenant(first_account) do
  user = User.create!(email: "test@example.com")
  post = Post.create!(user: user, content: "Lorem Ipsum")
end

ActsAsTenant.with_tenant(first_account) do
  Post.first.content # -> "Lorem Ipsum"
end

ActsAsTenant.with_tenant(last_account) do
  Post.first.nil? # -> true because we did not create a tenant
end

Post.first.content # -> "Lorem Ipsum"

ActsAsTenant.configure do |config|
  config.require_tenant = true
end

Post.first.content # -> ActsAsTenant::Errors::NoTenantSet (ActsAsTenant::Errors::NoTenantSet)
</code></pre><p>When looking at the queries that are run by ActiveRecord, you’ll see it automatically appends the <code>account_id</code> to the User and Post that are created. Later, after we set <code>require_tenant</code>, you’ll see that the next command fails with an error.<ol start=9><li>From the terminal, we explicitly used <code>with_tenant</code>. acts_as_tenant has helpers for the controller as well. Depending on how your authentication systems and tenancy work, you can use domains, subdomains, or implicit tenancy based on the authenticated user. From here, you’ll need to implement something like:</ol><pre><code class=language-ruby>class ApplicationController &#60 ActionController::Base
  set_current_tenant_through_filter
  before_action :require_authentication
  before_action :set_tenant

  def require_authentication
    current_user || redirect_to(new_session_path)
  end

  def current_user
    @current_user ||= if session[:user_id].present?
			User.find(session[:user_id])
    end
  end

  def current_acount
    @current_account ||= current_user.try(:account)
  end

  def set_tenant
    set_current_tenant(current_account)
  end
end
</code></pre><p>Implementation of proper authentications are complex, so this is simply for example. The code specific to acts_as_tenant are <code>set_current_tenant_through_filter</code> and <code>before_action :set_tenant</code> and <code>def set_tenant</code>.<h2 id=migrating-to-acts_as_tenant><a href=#migrating-to-acts_as_tenant>Migrating to acts_as_tenant</a></h2><p>If you have an existing codebase that would benefit from acts_as_tenant, the migration is a process and can be broken into multiple steps:<ol><li><strong>Add a tenant_id column to each affected model</strong> - this step can be quite complicated. It requires data migrations and data updates. The method of updating columns will be dependent on the size of your database.<li><strong>Add the acts_as_tenant gem, but do not set require_tenant yet</strong><li><strong>Define the tenancy for your ApplicationController using either domains, subdomains, or filter</strong><li><strong>Define the tenancy for your Action Job</strong><li><strong>Define tenancy for your models</strong></ol><p>Taking a measured approach to migrating, you can deploy each of the steps above independently. And, you can deploy each model change independently of the entire change.<h2 id=removing-acts_as_tenant><a href=#removing-acts_as_tenant>Removing acts_as_tenant</a></h2><p>The best thing I can say about a library is: you can migrate away from it if it does not work for you. Because acts_as_tenant is not a deep integration as past multi-tenant libraries, it is possible to move away from acts_as_tenant.<h2 id=summary><a href=#summary>Summary</a></h2><p>Back in the 2009-ish era, Ruby on Rails and “The Cloud” grew up together when cloud-SaaS and social networks took off. Back then, the maximum performance of network attached storage was 100 IOPs and size maxed out at 1TB. The IOPs strangled database performance, and 1TB was an unbreakable limitation (if you did not RAID early). I started my career in that era. Due to infrastructure limitations, multi-tenant databases would start to see issues when an application hit as little as 50 requests per second. In this era, RAM was expensive and disk performance was not available. Because of this, “sharding” was talked about at all the conferences.<p><em>Side note: also, data was suddenly available everywhere, and there were business models that stored massive amounts of data hoping to figure out a business model later.</em><p>Now, in 2023, RAM is plentiful and IOPs are available. Scaling the database can be punted to 10s of thousands of requests per second.<p>Why do I say all this? Because now, we can approach multi-tenant apps and scaling more practically. Multi-tenant can focus on data-security and coding-practically instead of scaling. You may not ever get to the point of needing distributed data stores, but a solid multi-tenant implementation creates foundational success for your application.<p>The old multi-tenant Ruby Gems were for scalability. acts_as_tenant is built for practicality. ]]></content:encoded>
<category><![CDATA[ Ruby on Rails ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">154b989b0bceca71d07af0a50e0c670a7cca03cc23a3023377d73f25e9904bee</guid>
<pubDate>Wed, 20 Dec 2023 08:00:00 EST</pubDate>
<dc:date>2023-12-20T13:00:00.000Z</dc:date>
<atom:updated>2023-12-20T13:00:00.000Z</atom:updated></item></channel></rss>