<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0"><channel><title>CrunchyData Blog</title>
<atom:link href="https://www.crunchydata.com/blog/topic/fun-with-sql/rss.xml" rel="self" type="application/rss+xml" />
<link>https://www.crunchydata.com/blog/topic/fun-with-sql</link>
<image><url>https://www.crunchydata.com/card.png</url>
<title>CrunchyData Blog</title>
<link>https://www.crunchydata.com/blog/topic/fun-with-sql</link>
<width>800</width>
<height>419</height></image>
<description>PostgreSQL experts from Crunchy Data share advice, performance tips, and guides on successfully running PostgreSQL and Kubernetes solutions</description>
<language>en-us</language>
<pubDate>Fri, 24 Oct 2025 09:00:00 EDT</pubDate>
<dc:date>2025-10-24T13:00:00.000Z</dc:date>
<dc:language>en-us</dc:language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item><title><![CDATA[ Temporal Joins ]]></title>
<link>https://www.crunchydata.com/blog/temporal-joins</link>
<description><![CDATA[ How do you return the Nth related record from joined tables in PostgreSQL? Did you know that it has a few more options than other databases? Use DISTINCT ON and window functions that solve common temporal join challenges and avoid N+1 query problems. ]]></description>
<content:encoded><![CDATA[ <p>My first thought seeing a temporal join in 2008 was, “Why is this query so complex?” The company I was at relied heavily on database queries, as it was a CRM and student success tracking system for colleges and universities. The query returned a filtered list of users and their last associated record from a second table. The hard part about the query isn’t returning the last timestamp or even performing joins, it’s returning <em>only their last associated record</em> from a second table.<p>Back in 2008, we didn’t have window functions or CTEs, so the query algorithm was a series of nested tables that looked like this:<pre><code class=language-sql>SELECT
    *
FROM users, ( -- find the record for the last second_table by created_at and user_id
                SELECT
                    second_table.*
                FROM second_table, ( -- find the last second_table created_at per user_id
                                        SELECT
                                            user_id,
                                            max(created_at) AS created_at
                                        FROM second_table
                                        GROUP BY 1
                                    ) AS last_second_table_at
                WHERE
                    last_second_table_at.user_id = second_table.user_id
                    AND second_table.created_at = last_second_table_at.created_at
            ) AS last_second_table
WHERE users.id = last_second_table.user_id;
</code></pre><p><strong>See the Sample Code section below for the schema and data to run these queries.</strong><p>But, even that query was wrong because the second table may have records with <strong>duplicate</strong> <code>created_at</code> values. That was the source of a bug back in 2008 that resulted in duplicate rows being listed.<p>Obviously, we weren't using Postgres at the time because there has always been a simpler way to do this in Postgres using <code>DISTINCT ON</code>:<pre><code class=language-sql>SELECT DISTINCT ON (u.id)
    u.id,
    u.name,
    s.created_at AS last_action_time,
    s.action_type
FROM users u
JOIN second_table s ON u.id = s.user_id
ORDER BY u.id, s.created_at DESC, s.id DESC;
</code></pre><p>Temporal joins require attention to detail.<h2 id=robust-solution-ctes--window-functions><a href=#robust-solution-ctes--window-functions>Robust Solution: CTEs &#38 Window Functions</a></h2><p>Before we go too far into the topic, for those looking for a solution to their current problem, below is how I would write that query today if you aren't finding the <strong>first or last</strong> in a series. For these situations, we use <a href=https://www.crunchydata.com/blog/postgres-subquery-powertools-subqueries-ctes-materialized-views-window-functions-and-lateral>CTEs and window functions</a>, so there's no need to nest queries when we can abstract them for a cleaner purpose. Here is the template for the temporal joins that do not work with <code>DISTINCT ON</code>:<pre><code class=language-sql>WITH max_second_table AS (
    SELECT
        *
    FROM (
            SELECT
                *,
                -- Use ROW_NUMBER() window function to return the latest record:
                -- The ORDER BY clause is critical:
                -- 1. ORDER BY created_at DESC finds the latest time.
                -- 2. ORDER BY id DESC serves as a reliable tie-breaker
                ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) AS row_order
            FROM second_table
        ) AS ordered_second_table
    WHERE row_order = 2
)

SELECT
    *
FROM users
LEFT JOIN max_second_table ON users.id = max_second_table.user_id;
</code></pre><p>In this example, we are joining the <strong>second occurrence</strong> (<code>WHERE row_order = 2</code>) in the <code>second_table</code> for a user. For the university example, we used these types of queries to report on progress over time by showing the 1st, 2nd, 3rd, and <mjx-container class=MathJax jax=SVG><svg focusable=false height=1.932ex role=img style=vertical-align:0 viewBox="0 -853.7 1687.8 853.7"width=3.819ex xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink><defs><path d="M234 637Q231 637 226 637Q201 637 196 638T191 649Q191 676 202 682Q204 683 299 683Q376 683 387 683T401 677Q612 181 616 168L670 381Q723 592 723 606Q723 633 659 637Q635 637 635 648Q635 650 637 660Q641 676 643 679T653 683Q656 683 684 682T767 680Q817 680 843 681T873 682Q888 682 888 672Q888 650 880 642Q878 637 858 637Q787 633 769 597L620 7Q618 0 599 0Q585 0 582 2Q579 5 453 305L326 604L261 344Q196 88 196 79Q201 46 268 46H278Q284 41 284 38T282 19Q278 6 272 0H259Q228 2 151 2Q123 2 100 2T63 2T46 1Q31 1 31 10Q31 14 34 26T39 40Q41 46 62 46Q130 49 150 85Q154 91 221 362L289 634Q287 635 234 637Z"id=MJX-1-TEX-I-1D441 /><path d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"id=MJX-1-TEX-I-1D461 /><path d="M137 683Q138 683 209 688T282 694Q294 694 294 685Q294 674 258 534Q220 386 220 383Q220 381 227 388Q288 442 357 442Q411 442 444 415T478 336Q478 285 440 178T402 50Q403 36 407 31T422 26Q450 26 474 56T513 138Q516 149 519 151T535 153Q555 153 555 145Q555 144 551 130Q535 71 500 33Q466 -10 419 -10H414Q367 -10 346 17T325 74Q325 90 361 192T398 345Q398 404 354 404H349Q266 404 205 306L198 293L164 158Q132 28 127 16Q114 -11 83 -11Q69 -11 59 -2T48 16Q48 30 121 320L195 616Q195 629 188 632T149 637H128Q122 643 122 645T124 664Q129 683 137 683Z"id=MJX-1-TEX-I-210E /></defs><g transform=scale(1,-1) fill=currentColor stroke=currentColor stroke-width=0><g data-mml-node=math><g data-mml-node=msup><g data-mml-node=mi><use data-c=1D441 xlink:href=#MJX-1-TEX-I-1D441 /></g><g data-mml-node=TeXAtom transform="translate(975.3,363) scale(0.707)"data-mjx-texclass=ORD><g data-mml-node=mi><use data-c=1D461 xlink:href=#MJX-1-TEX-I-1D461 /></g><g data-mml-node=mi transform=translate(361,0)><use data-c=210E xlink:href=#MJX-1-TEX-I-210E /></g></g></g></g></g></svg></mjx-container> occurrences of events.<p>Is this actually less code than the first example? No, but it is compartmentalized with a cleaner purpose.<p>Also, introducing the <strong>primary key (<code>id</code>)</strong> in the <code>ORDER BY</code> clause provides the necessary <strong>tie-breaker logic</strong> for the sorting -- that is how we fixed the SQL issue in the opening text.<h2 id=problem-with-orms><a href=#problem-with-orms>Problem with ORMs</a></h2><p>Due to their query complexity, ORMs are generally not capable of handling temporal joins without complex manipulation. The ORM I'm most familiar with is <strong>ActiveRecord</strong>, part of the Ruby on Rails suite. When Rails developers encounter temporal joins, they typically resort to the <strong><a href=https://www.crunchydata.com/blog/postgresql-for-solving-n+1-queries-in-ruby-on-rails>N+1 query pattern</a></strong> from their application code like this:<pre><code class=language-ruby>users = User.all

users.each do |user|
  last_action = user.second_table.last
end
</code></pre><p>If you aren't running this query too frequently or over too many user records, this is generally <em>performant enough</em>. However, this approach becomes <strong>suboptimal</strong> for application performance as the user list grows because each iteration of the loop requires a network hop back and forth with the database and an object initialization in the application. While you can make ActiveRecord do this natively, the resulting code is often harder to read and maintain for the typical use case—a pattern you see in other ORMs as well.<h2 id=sample-code><a href=#sample-code>Sample Code</a></h2><p>Below is the sample SQL you can use to load data into your database to test a few of these queries. Note that <strong>Alice has two actions at the exact same timestamp</strong> to replicate the original bug scenario.<pre><code class=language-sql>CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name TEXT
);

CREATE TABLE second_table (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action_type TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE
);

INSERT INTO users (name) VALUES ('Alice'), ('Bob'), ('Charlie');

-- Alice has two actions at the exact same timestamp (The 2008 bug scenario)
INSERT INTO second_table (user_id, action_type, created_at) VALUES
(1, 'login', '2023-10-01 10:00:00'),
(1, 'page_view', '2023-10-01 10:00:00'),
(2, 'purchase', '2023-10-02 11:00:00'),
(3, 'registration', '2023-10-03 12:00:00'),
(3, 'profile_update', '2023-10-04 13:00:00');
</code></pre><h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p>The term "temporal join" isn't a common piece of developer jargon, but the underlying pattern, retrieving the <mjx-container class=MathJax jax=SVG><svg focusable=false height=1.932ex role=img style=vertical-align:0 viewBox="0 -853.7 1687.8 853.7"width=3.819ex xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink><defs><path d="M234 637Q231 637 226 637Q201 637 196 638T191 649Q191 676 202 682Q204 683 299 683Q376 683 387 683T401 677Q612 181 616 168L670 381Q723 592 723 606Q723 633 659 637Q635 637 635 648Q635 650 637 660Q641 676 643 679T653 683Q656 683 684 682T767 680Q817 680 843 681T873 682Q888 682 888 672Q888 650 880 642Q878 637 858 637Q787 633 769 597L620 7Q618 0 599 0Q585 0 582 2Q579 5 453 305L326 604L261 344Q196 88 196 79Q201 46 268 46H278Q284 41 284 38T282 19Q278 6 272 0H259Q228 2 151 2Q123 2 100 2T63 2T46 1Q31 1 31 10Q31 14 34 26T39 40Q41 46 62 46Q130 49 150 85Q154 91 221 362L289 634Q287 635 234 637Z"id=MJX-2-TEX-I-1D441 /><path d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"id=MJX-2-TEX-I-1D461 /><path d="M137 683Q138 683 209 688T282 694Q294 694 294 685Q294 674 258 534Q220 386 220 383Q220 381 227 388Q288 442 357 442Q411 442 444 415T478 336Q478 285 440 178T402 50Q403 36 407 31T422 26Q450 26 474 56T513 138Q516 149 519 151T535 153Q555 153 555 145Q555 144 551 130Q535 71 500 33Q466 -10 419 -10H414Q367 -10 346 17T325 74Q325 90 361 192T398 345Q398 404 354 404H349Q266 404 205 306L198 293L164 158Q132 28 127 16Q114 -11 83 -11Q69 -11 59 -2T48 16Q48 30 121 320L195 616Q195 629 188 632T149 637H128Q122 643 122 645T124 664Q129 683 137 683Z"id=MJX-2-TEX-I-210E /></defs><g transform=scale(1,-1) fill=currentColor stroke=currentColor stroke-width=0><g data-mml-node=math><g data-mml-node=msup><g data-mml-node=mi><use data-c=1D441 xlink:href=#MJX-2-TEX-I-1D441 /></g><g data-mml-node=TeXAtom transform="translate(975.3,363) scale(0.707)"data-mjx-texclass=ORD><g data-mml-node=mi><use data-c=1D461 xlink:href=#MJX-2-TEX-I-1D461 /></g><g data-mml-node=mi transform=translate(361,0)><use data-c=210E xlink:href=#MJX-2-TEX-I-210E /></g></g></g></g></g></svg></mjx-container> related record, is critical in <a href=https://www.crunchydata.com/blog/window-functions-for-data-analysis-with-postgres>reporting and analytics</a>. It's a known pattern among people who have worked on a code base that relies heavily on SQL capabilties, typically for reporting.<p>Using the PostgreSQL feature <strong><code>DISTINCT ON</code></strong> for the simplest case, or <strong><a href=https://www.crunchydata.com/developers/playground/ctes-and-window-functions>CTEs with Window Functions</a></strong> for complex retrieval, we avoid the bugs of older SQL patterns and eliminate the performance penalty of the N+1 problem.<p>If you would like to learn more about advanced SQL patterns, check out our <a href=https://www.crunchydata.com/developers/playground>Postgres Playground</a>.</p><style>mjx-container[jax=SVG]{direction:ltr}mjx-container[jax=SVG]>svg{overflow:visible;min-height:1px;min-width:1px}mjx-container[jax=SVG]>svg a{fill:blue;stroke:blue}mjx-container[jax=SVG][display=true]{display:block;text-align:center;margin:1em 0}mjx-container[jax=SVG][display=true][width=full]{display:flex}mjx-container[jax=SVG][justify=left]{text-align:left}mjx-container[jax=SVG][justify=right]{text-align:right}g[data-mml-node=merror]>g{fill:red;stroke:red}g[data-mml-node=merror]>rect[data-background]{fill:yellow;stroke:none}g[data-mml-node=mtable]>line[data-line],svg[data-table]>g>line[data-line]{stroke-width:70px;fill:none}g[data-mml-node=mtable]>rect[data-frame],svg[data-table]>g>rect[data-frame]{stroke-width:70px;fill:none}g[data-mml-node=mtable]>.mjx-dashed,svg[data-table]>g>.mjx-dashed{stroke-dasharray:140}g[data-mml-node=mtable]>.mjx-dotted,svg[data-table]>g>.mjx-dotted{stroke-linecap:round;stroke-dasharray:0,140}g[data-mml-node=mtable]>g>svg{overflow:visible}[jax=SVG] mjx-tool{display:inline-block;position:relative;width:0;height:0}[jax=SVG] mjx-tool>mjx-tip{position:absolute;top:0;left:0}mjx-tool>mjx-tip{display:inline-block;padding:.2em;border:1px solid #888;font-size:70%;background-color:#f8f8f8;color:#000;box-shadow:2px 2px 5px #aaa}g[data-mml-node=maction][data-toggle]{cursor:pointer}mjx-status{display:block;position:fixed;left:1em;bottom:1em;min-width:25%;padding:.2em .4em;border:1px solid #888;font-size:90%;background-color:#f8f8f8;color:#000}foreignObject[data-mjx-xml]{font-family:initial;line-height:normal;overflow:visible}mjx-container[jax=SVG] path[data-c],mjx-container[jax=SVG] use[data-c]{stroke-width:3}</style> ]]></content:encoded>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Christopher.Winslett@crunchydata.com (Christopher Winslett) ]]></author>
<dc:creator><![CDATA[ Christopher Winslett ]]></dc:creator>
<guid isPermalink="false">da09e0153c614a35f860efd24fb3f98f0bff8151300e44ffb6009d17dce4c1b8</guid>
<pubDate>Fri, 24 Oct 2025 09:00:00 EDT</pubDate>
<dc:date>2025-10-24T13:00:00.000Z</dc:date>
<atom:updated>2025-10-24T13:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Crazy Idea to Postgres in the Browser ]]></title>
<link>https://www.crunchydata.com/blog/crazy-idea-to-postgres-in-the-web-browser</link>
<description><![CDATA[ We've got Postgres running in a web browser and lots of folks were curious how it was built. Joey goes through the prototyping steps. ]]></description>
<content:encoded><![CDATA[ <p>We <a href=https://www.crunchydata.com/blog/learn-postgres-at-the-playground>just launched</a> our <a href=https://www.crunchydata.com/developers/tutorials>Postgres Playground</a>. Running Postgres in the web browser was not exactly commonplace before, so naturally, people are wondering how it works.<p>It actually started as a fun weekend experiment. Here's a screenshot I saved, just moments after recovering from the initial "whoa, it's working!" effect.<p><img alt="Screenshot of PostgreSQL running in the Chrome web browser"loading=lazy src=/blog-assets/browser-pg/screenshot.webp><p>The next morning, I shared this screenshot in our internal Slack channel for web frontend engineering. Our mental gears began to turn as we imagined what might (and might not) be possible. After a bit of work, we built upon some interesting ideas, and this fun weekend hack evolved into what is now the <a href=https://www.crunchydata.com/developers/tutorials>Postgres Playground</a>!<p>This blog post focuses on how to run PostgreSQL in the web browser, but there are other interesting pieces involved in the playground as well. For example, the content in the tutorials actually lives in <a href=https://www.notion.so/>Notion</a> documents. There may be a blog post about that in the near future from one of my colleagues, so stay tuned!<h2 id=why><a href=#why>Why?</a></h2><p>I stumbled upon an interesting <a href=https://wasmer.io/posts/markdown-playgrounds-powered-by-wasm>blog post</a> about how Wasmer has a "Run in Playground" link for some Markdown fenced code blocks for a few of their <a href=https://wapm.io/>WAPM</a> packages. There was one in particular that <em>really</em> got my attention: <a href=https://wapm.io/sqlite/sqlite>SQLite</a>. On that page, there is a fenced code block with some SQL queries inside. If you click the "Run in Playground" button, it runs the query right there in the web browser with SQLite compiled to WebAssembly.<p>After running that SQLite query in my browser, I thought, "Can I do this with Postgres?".<p>The modern web browser is a very powerful platform, and this platform's capabilities are constantly increasing. However, WebAssembly still has some growing to do in some areas. After some quick research, I found that the web browser simply does not offer the networking features that Postgres needs. That would seem like a pretty big obstacle.<p><em><strong>However</strong>...</em><p>As I mentioned, the modern web browser is a very powerful platform. Let's just change the target platform to something other than WebAssembly, then run it in WebAssembly anyway like a rebel. 😎<h2 id=virtual-machines-in-the-browser><a href=#virtual-machines-in-the-browser>Virtual machines in the browser</a></h2><p>It's actually possible to emulate a PC <em>inside the web browser</em>! There have been quite a few implementations over the years. Some even started out in JavaScript, years before WebAssembly was an option. Here are a few that I found especially interesting:<table><thead><tr><th>Emulator<th>Architecture<tbody><tr><td><a href=https://github.com/nepx/halfix>Halfix</a><td>x86<tr><td><a href=https://bellard.org/jslinux/>JSLinux</a><td>x86 and RISC-V<tr><td><a href=https://github.com/s-macke/jor1k>jor1k</a><td>OpenRISC 1000<tr><td><a href=https://github.com/copy/v86>v86</a><td>x86<tr><td><a href=https://webvm.io/>WebVM</a><td>x86</table><p>I ended up choosing v86 for this. The author started the project in 2011, and it's still active. Many questions have been answered in the GitHub issues and discussions over the years. I'm definitely not an expert in Linux or virtualization, so being able to search the repo for answers was very helpful.<p>v86's performance also seemed to be among the best, compared to similar open source emulators that run in the browser. In early 2021, they merged a <a href=https://github.com/copy/v86/pull/388>Rust port + JIT</a> into the master branch, which provided a significant performance boost over the original JavaScript implementation.<h2 id=build><a href=#build>Build</a></h2><p>For this blog post, we'll use Alpine Linux. It's a lightweight Linux distribution and a very popular base for many Docker images. They also have a version with a slimmed-down kernel, optimized for virtual machines.<h3 id=install-alpine-linux><a href=#install-alpine-linux>Install Alpine Linux</a></h3><p>Note: You'll need to have QEMU installed.<p>Download the Alpine image that is optimized for VMs. We'll need the x86 (not x86_64) build.<pre><code class=language-shell>wget https://dl-cdn.alpinelinux.org/alpine/v3.16/releases/x86/alpine-virt-3.16.0-x86.iso
</code></pre><p>Create a disk image for the VM. I chose my disk size somewhat randomly, so feel free to make it larger if you'd like. You could probably make it smaller, but I'm not sure what the minimum size limit would be here.<pre><code class=language-shell>qemu-img create alpine.img 512M
</code></pre><p>Start the VM<pre><code class=language-shell>qemu-system-x86_64 \
  -m 256M \
  -cdrom alpine-virt-3.16.0-x86.iso \
  -drive file=alpine.img,format=raw
</code></pre><p>Note: Depending on your host machine, you may be able to get a performance boost by appending another argument to the command:<table><thead><tr><th>Host machine<th>Argument<tbody><tr><td>Linux with x86 CPU<td><code>-accel kvm</code><tr><td>macOS with x86 (Intel) CPU<td><code>-accel hvf</code><tr><td>macOS with Apple Silicon<td>None, but <a href=https://www.crunchydata.com/blog/postgresql-benchmarks-apple-arm-m1-macbook-pro-2020 title="Benchmark comparisons: Apple M1 vs. various x86 CPUs">your CPU is a beast anyway</a>.</table><p>When the VM has finished booting, log in as 'root'.<p>Run <code>setup-alpine</code>. For most of the setup questions, you can configure things to your liking, but these are important:<table><thead><tr><th>Question<th>Answer<tbody><tr><td><code>Which disk(s) would you like to use?</code><td><code>sda</code><tr><td><code>How would you like to use it?</code><td><code>sys</code><tr><td><code>WARNING: Erase the above disk(s) and continue?</code><td><code>y</code></table><p>After the installation has finished, use the <code>reboot</code> command to reboot the VM from the virtual hard disk image.<h3 id=install-postgres><a href=#install-postgres>Install Postgres</a></h3><p>After the VM has rebooted, log in as <code>root</code> again. We can now install and initialize Postgres.<pre><code class=language-shell>apk add postgresql --no-cache
/etc/init.d/postgresql setup
/etc/init.d/postgresql start
rc-update add postgresql
</code></pre><p>Now, a quick smoke test:<pre><code class=language-shell>su - postgres -c 'psql -c "SELECT version();"'
</code></pre><p>If it worked, you should see something like this:<pre><code class=language-text>                                                    version
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 14.4 on i586-alpine-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 32-bit
(1 row)
</code></pre><p>Note: On my machine, <code>psql</code> opened the version information in <code>less</code>, so I had to press <code>q</code> to exit that.<p>Now we can shut down the VM.<pre><code class=language-shell>poweroff
</code></pre><h2 id=run-the-vm-in-the-web-browser><a href=#run-the-vm-in-the-web-browser>Run the VM in the web browser</a></h2><ol><li>Go to <a href=https://copy.sh/v86/>v86's site</a>.<li>Scroll down to the "Setup" section.<li>For "Hard disk drive image", select the <code>alpine.img</code> file you created earlier.<li>Adjust "Memory size" to your liking. (I used 256 MB)<li>Click the "Start Emulation" button.</ol><p>Note: It will take a bit longer for the VM to boot. Running the VM in the web browser will not be as fast as it was in QEMU.<p>After it has finished booting, log in as root, then open <code>psql</code>:<pre><code class=language-shell>su - postgres -c psql
</code></pre><p>Congratulations! You are now running Postgres in your web browser.<p>Things to keep in mind:<ul><li>There is no internet access from inside the VM.<li>There is no data persistence, so changes are lost when leaving or refreshing the page.</ul> ]]></content:encoded>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Joey.Mezzacappa@crunchydata.com (Joey Mezzacappa) ]]></author>
<dc:creator><![CDATA[ Joey Mezzacappa ]]></dc:creator>
<guid isPermalink="false">270462b44374c1900301bdb1f6d9c8b050d7f4a97badd5db4ab40fe67fbde964</guid>
<pubDate>Wed, 24 Aug 2022 11:00:00 EDT</pubDate>
<dc:date>2022-08-24T15:00:00.000Z</dc:date>
<atom:updated>2022-08-24T15:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Rise of the Anti-Join ]]></title>
<link>https://www.crunchydata.com/blog/rise-of-the-anti-join</link>
<description><![CDATA[ Find me all the things in set "A" that are not in set "B". Paul has some suggestions of when to use the anti-join pattern in queries with some impressive results. ]]></description>
<content:encoded><![CDATA[ <p>Find me all the things in set "A" that are not in set "B".<p><img alt="diagram of foreign key references between tables"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/4cff80f1-0c32-4d14-d8ec-2307383d3c00/public><p>This is a pretty common query pattern, and it occurs in both non-spatial and spatial situations. As usual, there are multiple ways to express this query in SQL, but only a couple queries will result in the best possible performance.<h2 id=setup><a href=#setup>Setup</a></h2><p>The non-spatial setup starts with two tables with the numbers 1 to 1,000,000 in them, then deletes two records from one of the tables.<pre><code class=language-pgsql>CREATE TABLE a AS SELECT generate_series(1,1000000) AS i
ORDER BY random();
CREATE INDEX a_x ON a (i);

CREATE TABLE b AS SELECT generate_series(1,1000000) AS i
ORDER BY random();
CREATE INDEX b_x ON b (i);

DELETE FROM b WHERE i = 444444;
DELETE FROM b WHERE i = 222222;

ANALYZE;
</code></pre><p>The spatial setup is a 2M record table of geographic names, and a 3K record table of county boundaries. Most of the geonames are inside counties (because we tend to names things on land) but some of them are not (because sometimes we name things in the ocean, or our boundaries are not detailed enough).<p><img alt="diagram of new york city with dots at points that have geographic names"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/be1cb873-7bd0-4f15-2649-6d8721634b00/public><h2 id=subqueries-no-><a href=#subqueries-no->Subqueries? No. <!-- markdownlint-disable-line no-trailing-punctuation --></a></h2><p>Since the problem statement includes the words "not in", this form of the query seems superficially plausible:<pre><code class=language-pgsql>SELECT i
  FROM a
  WHERE i NOT IN (SELECT i FROM b);
</code></pre><p>Perfect! Give me everything from "A" that is not in "B"! Just what we want? In fact, running the query takes so long that I never got it to complete. The explain gives some hints.<pre><code class=language-bash>                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=1000.00..5381720008.33 rows=500000 width=4)
   Workers Planned: 2
   ->  Parallel Seq Scan on a  (cost=0.00..5381669008.33 rows=208333 width=4)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..23331.97 rows=999998 width=4)
                 ->  Seq Scan on b  (cost=0.00..14424.98 rows=999998 width=4)
</code></pre><p>Note that the subquery ends up materializing the whole second table into memory, where it is scanned over and over and over to test each key in table "A". Not good.<h2 id=except-maybe-><a href=#except-maybe->Except? Maybe. <!-- markdownlint-disable-line no-trailing-punctuation --></a></h2><p>PostgreSQL supports some <a href=https://www.postgresql.org/docs/current/queries-union.html>set-based key words</a> that allow you to find logical combinations of queries: <code>UNION</code>, <code>INTERSECT</code> and <code>EXCEPT</code>.<p>Here, we can make use of <code>EXCEPT</code>.<pre><code class=language-pgsql>SELECT a.i FROM a
EXCEPT
SELECT b.i FROM b;
</code></pre><p>The SQL still matches our mental model of the problem statement: everything in "A" <strong>except</strong> for everything in "B".<p>And it returns a correct answer in about <strong>2.3 seconds</strong>.<pre><code class=language-pgsql>   i
--------
 222222
 444444
(2 rows)
</code></pre><p>The query plan is interesting: the two tables are appended and then sorted for duplicates and then only non-dupes are omitted!<pre><code class=language-bash>                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 SetOp Except  (cost=322856.41..332856.40 rows=1000000 width=8)
   ->  Sort  (cost=322856.41..327856.41 rows=1999998 width=8)
         Sort Key: "*SELECT* 1".i
         ->  Append  (cost=0.00..58849.95 rows=1999998 width=8)
               ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..24425.00 rows=1000000 width=8)
                     ->  Seq Scan on a  (cost=0.00..14425.00 rows=1000000 width=4)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..24424.96 rows=999998 width=8)
                     ->  Seq Scan on b  (cost=0.00..14424.98 rows=999998 width=4)
</code></pre><p>It's a big hammer, but it works.<h2 id=anti-join-yes-><a href=#anti-join-yes->Anti-Join? Yes. <!-- markdownlint-disable-line no-trailing-punctuation --></a></h2><p>The best approach is the "anti-join". One way to express an anti-join is with a special "<a href=https://www.postgresql.org/docs/current/functions-subquery.html>correlated subquery</a>" syntax:<pre><code class=language-pgsql>SELECT a.i
  FROM a
  WHERE NOT EXISTS
    (SELECT b.i FROM b WHERE a.i = b.i);
</code></pre><p>So this returns results from "A" only where those results result in a no-record-returned subquery against "B".<p>It takes about <strong>850 ms</strong> on my test laptop, so <strong>3 times</strong> faster than using <code>EXCEPT</code> in this test, and gets the right answer. The query plan looks like this:<pre><code class=language-bash>                                     QUERY PLAN
------------------------------------------------------------------------------------
 Gather  (cost=16427.98..31466.36 rows=2 width=4)
   Workers Planned: 2
   ->  Parallel Hash Anti Join  (cost=15427.98..30466.16 rows=1 width=4)
         Hash Cond: (a.i = b.i)
         ->  Parallel Seq Scan on a  (cost=0.00..8591.67 rows=416667 width=4)
         ->  Parallel Hash  (cost=8591.66..8591.66 rows=416666 width=4)
               ->  Parallel Seq Scan on b  (cost=0.00..8591.66 rows=416666 width=4)
</code></pre><p>The same sentiment can be expressed without the <code>NOT EXISTS</code> construct, using only basic SQL and a <code>LEFT JOIN</code>:<pre><code class=language-pgsql>SELECT a.i
  FROM a
  LEFT JOIN b ON (a.i = b.i)
  WHERE b.i IS NULL;
</code></pre><p>This also takes about <strong>850 ms</strong>.<p>The <code>LEFT JOIN</code> is required to return a record for every row of "A". So what does it do if there's no record in "B" that satisfies the join condition? It returns <code>NULL</code> for the columns of "B" in the join relation for those records. That means any row with a <code>NULL</code> in a column of "B" that is normally non-<code>NULL</code> is a record in "A" that is not in "B".<h2 id=now-do-spatial><a href=#now-do-spatial>Now do Spatial</a></h2><p>The nice thing about the <code>LEFT JOIN</code> expression of of the solution is that it generalizes nicely to arbitrary join conditions, like those using spatial predicate functions.<p>"Find the geonames points that are not inside counties"... OK, we will <code>LEFT JOIN</code> geonames with counties and find the records where counties are <code>NULL</code>.<pre><code class=language-pgsql>SELECT g.name, g.geonameid, g.geom
  FROM geonames g
  LEFT JOIN counties c
    ON ST_Contains(c.geom, g.geom)
  WHERE g.geom IS NULL;
</code></pre><p>The answer pops out in about a minute.<p><img alt="a map of new york city with dots for geographic names and the ones in the water are pink"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/b3a33f81-dc16-470f-efd3-00552eae3400/public><p>Unsurprisingly, that's about how long a standard inner join takes to associate the 2M geonames with the 3K counties, since the anti-join has to do about that much work to determine which records do not match the join condition.<h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><ul><li>"Find the things in A that aren't in B" is a common use pattern.<li>The "obvious" SQL patterns might not be the most efficient ones.<li>The <code>WHERE NOT EXISTS</code> and <code>LEFT JOIN</code> patterns both result in the most efficient query plans and executions.</ul> ]]></content:encoded>
<category><![CDATA[ Spatial ]]></category>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">22da035d8b818ddd351b8b7e6a7ec842186e3d2168298c1765348790f973c704</guid>
<pubDate>Mon, 15 Aug 2022 08:00:00 EDT</pubDate>
<dc:date>2022-08-15T12:00:00.000Z</dc:date>
<atom:updated>2022-08-15T12:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Quick and Easy Postgres Data Compare ]]></title>
<link>https://www.crunchydata.com/blog/quick-and-easy-postgres-data-compare</link>
<description><![CDATA[ Brian offers a quick and efficient way to compare data between two different Postgres sources using the Postgres Foreign Data Wrapper (postgres_fdw). ]]></description>
<content:encoded><![CDATA[ <p>If you're checking archives or working with Postgres replication, data reconciliation can be a necessary task. Row counts can be one of the go to comparison methods but that does not show data mismatches. You could pull table data across the network and then compare each row and each field, but that can be a demand on resources. Today we'll walk through a simple solution for your Postgres toolbox - using Foreign Data Wrappers to connect and compare the two source datasets. With the foreign data wrapper and a little sql magic, we can compare data quickly and easily.<h2 id=creating-environments><a href=#creating-environments>Creating Environments</a></h2><p>To keep the environment simple so even with limited resources it can be practiced, we will use a single PostgreSQL cluster with two databases (<code>hrprod</code>, <code>hrreport</code>) connected via PostgreSQL Foreign Data Wrapper. The simulation here is a production database (<code>hrprod</code>) with a reporting database (<code>hrreport</code>). Keep in mind that the source and target do not have to be within the same PostgreSQL cluster.<p>For speed of creating the environment, the <a href=https://www.crunchydata.com/products/crunchy-postgresql-for-kubernetes>Crunchy Postgres for Kubernetes</a> was used and a simple PostgreSQL cluster deployed using the <a href=https://github.com/CrunchyData/postgres-operator-examples>Postgres Operator Examples</a> repository.<p>The rest of the steps will only show the steps performed within <code>psql</code> from the database containers.<h3 id=production-setup-hrprod><a href=#production-setup-hrprod>Production Setup (hrprod)</a></h3><p>The steps to create the simulated production database is simple: create the database, create the <code>postgres_fdw</code> extension, create the <code>employee</code> table and lastly populate the <code>employee</code> table with three rows of data.<pre><code class=language-pgsql>postgres=> create database hrprod;
CREATE DATABASE

postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> create extension postgres_fdw;
CREATE EXTENSION

hrprod=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrprod=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
</code></pre><h3 id=reporting-setup-hrreport><a href=#reporting-setup-hrreport>Reporting Setup (hrreport)</a></h3><p>The steps are then repeated to create the simulated reporting database.<pre><code class=language-pgsql>postgres=> create database hrreport;
CREATE DATABASE

postgres=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> create extension postgres_fdw;
CREATE EXTENSION

hrreport=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrreport=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
</code></pre><p>With this, the setup is complete and the data in the <code>employee</code> table match in both databases.<h2 id=data-compare><a href=#data-compare>Data Compare</a></h2><p>The compare will be performed from the reporting database side (<code>hrreport</code>). To start, a temporary table named <code>data_compare</code> is created. The <code>data_compare</code> table is to store three pieces of information:<ul><li><code>source_name</code> column that identifies where the data came from (<code>hrprod</code> or <code>hrreport</code> in this example).<li><code>id</code> column that will store the value(s) of the primary key from the table.<li><code>hash_value</code> column that stores the hash value of all the non-key fields in the table.</ul><p>Note that if the table has a composite key, the <code>id</code> column would be populated by joining the values into a single string. The hash occurs on the source side and only the hashed value is used for the comparison, greatly reducing network traffic, transfer time, etc.<h3 id=setup-data-compare><a href=#setup-data-compare>Setup Data Compare</a></h3><p>Create the <code>data_compare</code> table in both the production (<code>hrprod</code>) and target (<code>hrreport</code>) databases.<pre><code class=language-pgsql>hrreport=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE
</code></pre><p>An <code>INSERT</code> statement will be executed on both the source and target to populate the <code>data_compare</code> table and then the contents of the tables compared to identify differences. To reduce time and transfer for multiple compare passes, the <code>data_compare</code> table contents can be transferred via the foreign table or <code>pg_dump</code>, etc.<p>The following steps were used to create the foreign table.<pre><code class=language-pgsql>hrreport=> CREATE SERVER hrprod FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'hrprod', port '5432');
CREATE SERVER

hrreport=> CREATE USER MAPPING FOR current_user SERVER hrprod options (user 'postgres', password 'welcome1');
CREATE USER MAPPING

CREATE FOREIGN TABLE hrprod_data_compare (source_name varchar(140), id varchar(1000), hash_value varchar(100)) SERVER hrprod OPTIONS (table_name 'data_compare');

</code></pre><h3 id=perform-initial-compare><a href=#perform-initial-compare>Perform Initial Compare</a></h3><p>Populate the <code>data_compare</code> table in both the source (<code>hrprod</code>) and target (<code>hrreport</code>) databases.<pre><code class=language-pgsql>hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3


hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3
</code></pre><p>At this point we know that the data is exactly the same so let's look at the SQL that is used to perform the actual comparison.<pre><code class=language-pgsql>hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
(3 rows)
</code></pre><p>The <code>compare_result</code> confirms that two sets of data are equal. An alternate compare SQL is included at the end of this article to show various ways the data can be compared when the two <code>data_compare</code> tables are combined.<h3 id=create-an-out-of-sync-condition-and-compare><a href=#create-an-out-of-sync-condition-and-compare>Create an Out-Of-Sync Condition and Compare</a></h3><p>At this stage, three rows exist in the table and the data matches.<pre><code class=language-pgsql>hrprod=> SELECT * FROM employee;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
(3 rows)
</code></pre><p>To create the out of sync, the following changes will be performed:<ul><li>In <code>hrprod</code>, add CS Lewis with id 4, Charles Babbage with id 5, Blaise Pascal with id 6.<li>In <code>hrreport</code>, add Charles Babbage with id 4, CS Lewis with id 5, Kenny Rogers with id 7.</ul><p>Notice that the ids for CS Lewis and Charles Babbage have been swapped and a unique record added to each database (Blaise Pascal to <code>hrprod</code> and Kenny Rogers to <code>hrreport</code>). The compare should show that 3 rows match, 2 rows have differences and 2 rows are in one database but not the other.<p>Up first, changes to source (<code>hrprod</code>).<pre><code class=language-pgsql>hrprod=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (4,'CS','Lewis','author'),(5,'Charles','Babbage','math'),(6,'Blaise','Pascal','math');

hrprod=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | CS         | Lewis      | author
  5 | Charles    | Babbage    | math
  6 | Blaise     | Pascal     | math
(6 rows)
</code></pre><p>Now the changes to the target (<code>hrreport</code>).<pre><code class=language-pgsql>hrreport=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (5,'CS','Lewis','author'),(4,'Charles','Babbage','math'),(7,'Kenny','Rogers','music');

hrreport=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | Charles    | Babbage    | math
  5 | CS         | Lewis      | author
  7 | Kenny      | Rogers     | music
(6 rows)
</code></pre><p>To summarize the current state:<ul><li>Three rows that match (id=1, 2, 3)<li>Two rows that do not match (id=4, id=5)<li>Two rows that exist in one but not the other (id=6, id=7)</ul><p>Let's now clear the <code>data_compare</code> tables and perform the compare again.<pre><code class=language-pgsql>postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> DELETE FROM data_compare;
DELETE 3

hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> DELETE FROM data_compare;
DELETE 3

hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6
</code></pre><p>Now for the compare and the results.<pre><code class=language-pgsql>hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
 4  | bbee9d6cccbeac4e9125ec78507c4eb7 | 57acef6ed228a52b8c42f0a6c155e62b | difference
 5  | 57acef6ed228a52b8c42f0a6c155e62b | bbee9d6cccbeac4e9125ec78507c4eb7 | difference
 6  | 047742fb256df0b78cebc3fbbc3ca4ad |                                  | row not on target
 7  |                                  | 66e5e35673780bd392d2f81d589fbb52 | row not on source
 (7 rows)
</code></pre><p>The above output indicates that rows with id = 1 thru 3 exists in both databases and the content of the rows match. Rows with id 4 and 5 exists in each database but the contents of the row is different. Going a step further, one could see that the hash values are the same between the two different rows but associated to the wrong id. Row with id 6 only exist on the target (<code>hrreport</code>) while the row with id 7 only exists on the source (<code>hrprod</code>). In total, there are 4 rows that are out of sync.<p>With the rows identified, proper steps can be performed to sync the appropriate rows. Last thought, imagine for a moment that logical replication was in place between the two databases and changes were pending on the target due to lag. The INSERT into the <code>data_compare</code> could be performed only on the rows flagged as out of sync to verify just those rows once replication lag is gone.<h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p>Comparing data can be a monumental task. However, this little trick has come in handy over the years when expensive data compare software packages were not an option. There is still room for some creativity with the compare SQL to meet the exact needs of the compare. For example, only show rows that are missing from one side or the other.<p>Alternate Compare SQL:<pre><code class=language-pgsql>SELECT id, hash_value,
       count(src1) src1,
       count(src2) src2
 FROM
     ( SELECT a.*,
              1 src1,
              null src2
        FROM data_compare a
        WHERE source_name='hrprod'
        UNION ALL
        SELECT b.*,
               null src1,
               2 src2
        FROM data_compare b
        WHERE source_name='hrreport'
    ) c
 GROUP BY id, hash_value
 HAVING count(src1) &#60> count(src2);
</code></pre><p>So by setting up postgres_fdw, hashing the non-key fields, and writing a sql query to see if any rows are different - you can do a quick and simple Postgres data comparison. Have another solution you like for data compare? Let us know at @crunchydata. ]]></content:encoded>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Brian.Pace@crunchydata.com (Brian Pace) ]]></author>
<dc:creator><![CDATA[ Brian Pace ]]></dc:creator>
<guid isPermalink="false">fa294ef44d5f7c11b83007090c7eaec3f34f9a685430ae518306e3ca35d5fb1e</guid>
<pubDate>Wed, 15 Jun 2022 11:00:00 EDT</pubDate>
<dc:date>2022-06-15T15:00:00.000Z</dc:date>
<atom:updated>2022-06-15T15:00:00.000Z</atom:updated></item>
<item><title><![CDATA[ Parquet and Postgres in the Data Lake ]]></title>
<link>https://www.crunchydata.com/blog/parquet-and-postgres-in-the-data-lake</link>
<description><![CDATA[ Have too much static data? Paul has a idea of moving some of it to a datalake. He provides a walk-through of setting up Parquet with Postgres using the parquet_fdw. ]]></description>
<content:encoded><![CDATA[ <style>
    .black-box {
        background-color: black;
        color: white;
        padding: 20px;
        text-align: left;
        align-items: left;
        margin: 20px auto;
        border-radius: 10px;
        width: auto;
        height: auto;
    }
    .black-box a {
        color: white;
        text-decoration: underline;
    }
</style> <div class="black-box">
Interested in Spatial analytics? You can now connect Postgres and PostGIS to CSV, JSON, Parquet / GeoParquet, Iceberg, and more with <a href="https://www.crunchydata.com/products/warehouse">Crunchy Data Warehouse</a>.
</div><h2 id=static-data-is-different><a href=#static-data-is-different>Static Data is Different</a></h2><p>A couple weeks ago, I came across a <a href=https://retool.com/blog/how-we-upgraded-postgresql-database/>blog from Retool</a> on their experience migrating a 4TB database. They put in place some good procedures and managed a successful migration, but the whole experience was complicated by the size of the database. The size of the database was the result of a couple of very large "logging" tables: an edit log and an audit log.<p><img alt="diagram showing edit tables being small and audit tables being large"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/bef07a79-72fe-4779-e9c1-747c302de400/public><p>The thing about log tables is, they don't change much. They are append-only by design. They are also queried fairly irregularly, and the queries are often time ranged: "tell me what happened then" or "show me the activities between these dates".<p>So, one way the Retool migration could have been easier is if their log tables were constructed as time-ranged partitions. That way there'd would only be one "live" table in the partition set (the one with the recent entries) and a larger collection of historical tables. The migration could move the live partition as part of the critical path, and do all the historical partitions later.<p><img alt="moving partitions to parquet with postgres"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/84ac7f95-df8a-4d7d-1bc2-8600f0007100/public><p>Even after breaking up the log tables into manageable chunks they still remain, in aggregate, pretty big! The PostgreSQL <a href=https://www.postgresql.org/docs/current/ddl-partitioning.html>documentation on partitioning</a> has some harsh opinions about stale data living at the end of a partition collection:<blockquote><p>The simplest option for removing old data is to drop the partition that is no longer necessary.</blockquote><p>There's something to that! All those old historical records just fluff up your base backups, and maybe you almost never have occasion to query it.<p>Is there an alternative to dropping the tables?<h2 id=dump-your-data-in-the-lake><a href=#dump-your-data-in-the-lake>Dump Your Data in the Lake</a></h2><p>What if there was a storage option that was still durable, allowed access via multiple query tools, and could integrate transparently into your operational transactional database?<p><img alt="image of a mountain lake with mountains behind it"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/be0d5438-1a47-4158-cf7e-f0659fe77f00/public><p>How about: storing the static data in <a href=https://www.upsolver.com/blog/apache-parquet-why-use>Parquet format</a> but retaining database access to the data via the <a href=https://github.com/adjust/parquet_fdw/>parquet_fdw</a>?<p><img alt="show the different types of data: application, warm log tables in Postgres, cold logs in Parquet"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/6a09daa9-6e72-46e9-ac6e-9306f27a4700/public><p>Sounds a bit crazy, but:<ul><li>A foreign parquet table can participate in a partition along with a native PostgreSQL table.<li>A parquet file can also be consumed by <a href=https://arrow.apache.org/docs/r/reference/read_parquet.html>R</a>, <a href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html>Python</a>, <a href=https://github.com/xitongsys/parquet-go>Go</a> and a host of cloud applications.<li>Modern PostgreSQL (14+) can parallelize access to foreign tables, so even collections of Parquet files can be scanned effectively.<li>Parquet stores data compressed, so you can get way more raw data into less storage.</ul><h2 id=wait-parquet><a href=#wait-parquet>Wait, Parquet?</a></h2><p>Parquet is a language-independent storage format, designed for online analytics, so:<ul><li>Column oriented<li>Typed<li>Binary<li>Compressed</ul><p>A standard table in PostgreSQL will be row-oriented on disk.<p><img alt="diagram of a row-oriented table with 3 columns"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/a76b0bbc-f087-49f8-bd13-114d45b54a00/public><p>This layout is good for things PostgreSQL is expected to do, like query, insert, update and delete data a "few" records at a time. (The value of "a few" can run into the hundreds of thousands or millions, depending on the operation.)<p>A Parquet file stores data column-oriented on the disk, in batches called "row groups".<p><img alt="diagram showing how parquet stores column-oriented files on disk"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/ec88fe38-72af-4ad9-fdd8-9a7306f08c00/public><p>You can see where the Parquet format gets its name: the data is grouped into little squares, like a parquet floor. One of the advantages of grouping data together, is that compression routines tend to work better on data of the same type, and even more so when the data elements have the same values.<h2 id=does-this-even-work><a href=#does-this-even-work>Does This Even Work?</a></h2><p>In a word "yes", but with some caveats: Parquet has been around for several years, but the ecosystem supporting it is still, relatively, in flux. New releases of the underlying C++ libraries are still coming out regularly, the <a href=https://github.com/adjust/parquet_fdw>parquet_fdw</a> is only a couple years old, and so on.<p>However, I was able to demonstrate to my own satisfaction that things were baked enough to be interesting.<h3 id=loading-data><a href=#loading-data>Loading Data</a></h3><p>I started with a handy data table of Philadelphia parking infractions, that I used in a previous <a href=https://www.crunchydata.com/blog/performance-and-spatial-joins>blog post on spatial joins</a>, and sorted the file by date of infraction, <code>issue_datetime</code>.<p><img alt="flow diagram of data going from CSV to Postgres to Parquet file"loading=lazy src=https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/8f6ac3e3-8038-4c9e-28f0-ddce41938900/public></p><details><summary>Data Download and Sort</summary><pre><code class=language-bash>#
# Download Philadelphia parking infraction data
#
curl "https://phl.carto.com/api/v2/sql?filename=parking_violations&#38format=csv&#38skipfields=cartodb_id,the_geom,the_geom_webmercator&#38q=SELECT%20*%20FROM%20parking_violations%20WHERE%20issue_datetime%20%3E=%20%272012-01-01%27%20AND%20issue_datetime%20%3C%20%272017-12-31%27" > phl_parking_raw.csv

#
# Sort it
#
sort -k2 -t, phl_parking_raw.csv > phl_parking.csv
</code></pre><p>Sorting the data by <code>issue_datetime</code> will make queries that filter against that column go faster in the column-oriented Parquet setup.</p></details><pre><code class=language-pgsql>-- Create parking infractions table
CREATE TABLE phl_parking (
    anon_ticket_number integer,
    issue_datetime timestamptz,
    state text,
    anon_plate_id integer,
    division text,
    location text,
    violation_desc text,
    fine float8,
    issuing_agency text,
    lat float8,
    lon float8,
    gps boolean,
    zip_code text
    );

-- Read in the parking data
\copy phl_parking FROM 'phl_parking.csv' WITH (FORMAT csv, HEADER true);
</code></pre><p>OK, so now I have an 8M record data table, good for some bulk data experiments. How big is the table?<pre><code class=language-pgsql>SELECT pg_size_pretty(pg_relation_size('phl_parking')) AS pg_table_size;
</code></pre><pre><code class=language-txt> pg_table_size
----------------
 1099 MB
</code></pre><p>Just over 1GB!<h3 id=generating-parquet><a href=#generating-parquet>Generating Parquet</a></h3><p>How do I get a Parquet file?<p>This turns out to be way harder than I expected. Most internet advice was around using Python or Spark to convert CSV files into Parquet. In the end, I used the very new (currently unreleased, coming in GDAL 3.5) <a href=https://gdal.org/drivers/vector/parquet.html>support for Parquet in GDAL library</a>, and the <code>ogr2ogr</code> command to do the conversion.<pre><code class=language-bash>ogr2ogr -f Parquet \
  /tmp/phl_parking.parquet \
  PG:"dbname=phl host=localhost" \
  phl_parking
</code></pre><p>For these tests the Parquet file will reside on my local disk in <code>/tmp</code>, though for cloud purposes it might reside on a cloud volume, or even (with the <a href=https://github.com/pgspider/parquet_s3_fdw>right software</a>) in an object store.<pre><code class=language-shell>$ ls -lh /tmp/phl_parking.parquet
-rw-r--r--  1 pramsey  wheel   216M 29 Apr 10:44 /tmp/phl_parking.parquet
</code></pre><p>Thanks to data compression, the Parquet file is 20% the size of the database table!<h3 id=querying-parquet><a href=#querying-parquet>Querying Parquet</a></h3><p>Querying Parquet in PostgreSQL involves a number of parts, which can be challenging to build right now.<ul><li><a href=https://arrow.apache.org/install/>Apache libarrow</a>, built with Parquet support enabled.<li><a href=https://github.com/adjust/parquet_fdw/>parquet_fdw</a> itself.</ul><p>Note that <code>parquet_fdw</code> requires <code>libarrow</code> version 6, not the recently released version 7.<p>Once the FDW and supporting libraries are built, though, everything works just like other FDW extensions.<pre><code class=language-pgsql>CREATE EXTENSION parquet_fdw;

CREATE SERVER parquet_srv FOREIGN DATA WRAPPER parquet_fdw;

CREATE FOREIGN TABLE phl_parking_pq (
    anon_ticket_number integer,
    issue_datetime     timestamptz,
    state              text,
    anon_plate_id      integer,
    division           text,
    location           text,
    violation_desc     text,
    fine               float8,
    issuing_agency     text,
    lat                float8,
    lon                float8,
    gps                boolean,
    zip_code           text
    )
  SERVER parquet_srv
  OPTIONS (filename '/tmp/phl_parking.parquet',
           sorted 'issue_datetime',
           use_threads 'true');
</code></pre><p>Compared to the raw table, the Parquet file is similar in performance, usually a little slower. Just blasting through a row count (when the tables are pre-cached in memory).<pre><code class=language-pgsql>-- Give native table same indexing advantage
-- as the parquet file
CREATE INDEX ON phl_parking USING BRIN (issue_datetime);

SELECT Count(*) FROM phl_parking_pq;
-- Time: 1230 ms

SELECT Count(*) FROM phl_parking;
-- Time:  928 ms
</code></pre><p>Similarly, a filter also is slightly faster on PostgreSQL.<pre><code class=language-pgsql>SELECT Sum(fine), Count(1)
FROM phl_parking_pq
WHERE issue_datetime BETWEEN '2014-01-01' AND '2015-01-01';
-- Time: 692 ms

SELECT Sum(fine), Count(1)
FROM phl_parking
WHERE issue_datetime BETWEEN '2014-01-01' AND '2015-01-01';
-- Time: 571 ms
</code></pre><p>The <code>parquet_fdw</code> is very nicely implemented, and will even tell you the execution plan that will be used on the file for a given filter. For example, the previous filter involves opening about 20% of the 132 row groups in the Parquet file.<pre><code class=language-pgsql>EXPLAIN SELECT Sum(fine), Count(1)
FROM phl_parking_pq
WHERE issue_datetime BETWEEN '2014-01-01' AND '2015-01-01';

 Finalize Aggregate  (cost=6314.77..6314.78 rows=1 width=16)
   ->  Gather  (cost=6314.55..6314.76 rows=2 width=16)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=5314.55..5314.56 rows=1 width=16)
               ->  Parallel Foreign Scan on phl_parking_pq  (cost=0.00..5242.88 rows=14333 width=8)
                     Filter: ((issue_datetime >= '2014-01-01 00:00:00-08')
                          AND (issue_datetime &#60= '2015-01-01 00:00:00-08'))
                     Reader: Single File
                     Row groups: 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
                                 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
</code></pre><p>For plowing through the whole table and doing a summary, the Parquet query is about the same speed as the PostgreSQL query.<pre><code class=language-pgsql>SELECT issuing_agency, count(1)
FROM phl_parking_pq
GROUP BY issuing_agency;
-- Time: 3043 ms

SELECT issuing_agency, count(1)
FROM phl_parking
GROUP BY issuing_agency;
-- Time: 3103 ms
</code></pre><p>My internal model for the performance differences is that, while the Parquet format has some advantages in avoiding unnecessary reads, via row block filtering and accessing only the columns of interest, those advantages are offset by some inefficiencies in converting the raw data from parquet into the internal PostgreSQL formats.<h2 id=conclusion><a href=#conclusion>Conclusion</a></h2><p>Is that it? Well, we've seen:<ul><li>Parquet is a software-neutral format that is increasingly common in data science and the data centre.<li>Parquet access can be made transparent to PostgreSQL via the <code>parquet_fdw</code> extension.<li>Parquet storage can provide substantial space savings.<li>Parquet storage is a bit slower than native storage, but can offload management of static data from the back-up and reliability operations needed by the rest of the data.<li>Parquet storage of static data is much better than just throwing it out.</ul><p>More importantly, I think there's more to discuss:<ul><li>Can parquet files participate in partitions?<li>Can parquet files be accessed in parallel in collections?<li>Can parquet files reside in cloud object storage instead of filesystem storage?<li>Can PostgreSQL with parquet storage act like a "mid-range big data engine" to crunch numbers on large collections of static data backed by parquet?</ul><p>So far the ecosystem of Parquet tools has been dominated by the needs of data science (R and Python) and a handful of cloud OLAP system (Apache Spark), but there's no reason PostgreSQL can't start to partake of this common cloud format goodness, and start to swim in the data lake. ]]></content:encoded>
<category><![CDATA[ Fun with SQL ]]></category>
<author><![CDATA[ Paul.Ramsey@crunchydata.com (Paul Ramsey) ]]></author>
<dc:creator><![CDATA[ Paul Ramsey ]]></dc:creator>
<guid isPermalink="false">812d001121a9a2201cb553da451450f74bc7145e7a87f3cf281b99a242142b09</guid>
<pubDate>Tue, 03 May 2022 16:00:00 EDT</pubDate>
<dc:date>2022-05-03T20:00:00.000Z</dc:date>
<atom:updated>2022-05-03T20:00:00.000Z</atom:updated></item></channel></rss>