CrunchyData Blog

Temporal Joins

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 24 Oct 2025 09:00:00 EDT

My first thought seeing a temporal join in 2008 was, “Why is this query so complex?” The company I was at relied heavily on database queries, as it was a CRM and student success tracking system for colleges and universities. The query returned a filtered list of users and their last associated record from a second table. The hard part about the query isn’t returning the last timestamp or even performing joins, it’s returning only their last associated record from a second table.

Back in 2008, we didn’t have window functions or CTEs, so the query algorithm was a series of nested tables that looked like this:

SELECT
    *
FROM users, ( -- find the record for the last second_table by created_at and user_id
                SELECT
                    second_table.*
                FROM second_table, ( -- find the last second_table created_at per user_id
                                        SELECT
                                            user_id,
                                            max(created_at) AS created_at
                                        FROM second_table
                                        GROUP BY 1
                                    ) AS last_second_table_at
                WHERE
                    last_second_table_at.user_id = second_table.user_id
                    AND second_table.created_at = last_second_table_at.created_at
            ) AS last_second_table
WHERE users.id = last_second_table.user_id;

See the Sample Code section below for the schema and data to run these queries.

But, even that query was wrong because the second table may have records with duplicate created_at values. That was the source of a bug back in 2008 that resulted in duplicate rows being listed.

Obviously, we weren't using Postgres at the time because there has always been a simpler way to do this in Postgres using DISTINCT ON:

SELECT DISTINCT ON (u.id)
    u.id,
    u.name,
    s.created_at AS last_action_time,
    s.action_type
FROM users u
JOIN second_table s ON u.id = s.user_id
ORDER BY u.id, s.created_at DESC, s.id DESC;

Temporal joins require attention to detail.

Robust Solution: CTEs & Window Functions

Before we go too far into the topic, for those looking for a solution to their current problem, below is how I would write that query today if you aren't finding the first or last in a series. For these situations, we use CTEs and window functions, so there's no need to nest queries when we can abstract them for a cleaner purpose. Here is the template for the temporal joins that do not work with DISTINCT ON:

WITH max_second_table AS (
    SELECT
        *
    FROM (
            SELECT
                *,
                -- Use ROW_NUMBER() window function to return the latest record:
                -- The ORDER BY clause is critical:
                -- 1. ORDER BY created_at DESC finds the latest time.
                -- 2. ORDER BY id DESC serves as a reliable tie-breaker
                ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) AS row_order
            FROM second_table
        ) AS ordered_second_table
    WHERE row_order = 2
)

SELECT
    *
FROM users
LEFT JOIN max_second_table ON users.id = max_second_table.user_id;

In this example, we are joining the second occurrence (WHERE row_order = 2) in the second_table for a user. For the university example, we used these types of queries to report on progress over time by showing the 1st, 2nd, 3rd, and occurrences of events.

Is this actually less code than the first example? No, but it is compartmentalized with a cleaner purpose.

Also, introducing the primary key (id) in the ORDER BY clause provides the necessary tie-breaker logic for the sorting -- that is how we fixed the SQL issue in the opening text.

Problem with ORMs

Due to their query complexity, ORMs are generally not capable of handling temporal joins without complex manipulation. The ORM I'm most familiar with is ActiveRecord, part of the Ruby on Rails suite. When Rails developers encounter temporal joins, they typically resort to the N+1 query pattern from their application code like this:

users = User.all

users.each do |user|
  last_action = user.second_table.last
end

If you aren't running this query too frequently or over too many user records, this is generally performant enough. However, this approach becomes suboptimal for application performance as the user list grows because each iteration of the loop requires a network hop back and forth with the database and an object initialization in the application. While you can make ActiveRecord do this natively, the resulting code is often harder to read and maintain for the typical use case—a pattern you see in other ORMs as well.

Sample Code

Below is the sample SQL you can use to load data into your database to test a few of these queries. Note that Alice has two actions at the exact same timestamp to replicate the original bug scenario.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name TEXT
);

CREATE TABLE second_table (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action_type TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE
);

INSERT INTO users (name) VALUES ('Alice'), ('Bob'), ('Charlie');

-- Alice has two actions at the exact same timestamp (The 2008 bug scenario)
INSERT INTO second_table (user_id, action_type, created_at) VALUES
(1, 'login', '2023-10-01 10:00:00'),
(1, 'page_view', '2023-10-01 10:00:00'),
(2, 'purchase', '2023-10-02 11:00:00'),
(3, 'registration', '2023-10-03 12:00:00'),
(3, 'profile_update', '2023-10-04 13:00:00');

Conclusion

The term "temporal join" isn't a common piece of developer jargon, but the underlying pattern, retrieving the related record, is critical in reporting and analytics. It's a known pattern among people who have worked on a code base that relies heavily on SQL capabilties, typically for reporting.

Using the PostgreSQL feature DISTINCT ON for the simplest case, or CTEs with Window Functions for complex retrieval, we avoid the bugs of older SQL patterns and eliminate the performance penalty of the N+1 problem.

If you would like to learn more about advanced SQL patterns, check out our Postgres Playground.

Crazy Idea to Postgres in the Browser

Joey.Mezzacappa@crunchydata.com (Joey Mezzacappa) — Wed, 24 Aug 2022 11:00:00 EDT

We just launched our Postgres Playground. Running Postgres in the web browser was not exactly commonplace before, so naturally, people are wondering how it works.

It actually started as a fun weekend experiment. Here's a screenshot I saved, just moments after recovering from the initial "whoa, it's working!" effect.

The next morning, I shared this screenshot in our internal Slack channel for web frontend engineering. Our mental gears began to turn as we imagined what might (and might not) be possible. After a bit of work, we built upon some interesting ideas, and this fun weekend hack evolved into what is now the Postgres Playground!

This blog post focuses on how to run PostgreSQL in the web browser, but there are other interesting pieces involved in the playground as well. For example, the content in the tutorials actually lives in Notion documents. There may be a blog post about that in the near future from one of my colleagues, so stay tuned!

Why?

I stumbled upon an interesting blog post about how Wasmer has a "Run in Playground" link for some Markdown fenced code blocks for a few of their WAPM packages. There was one in particular that really got my attention: SQLite. On that page, there is a fenced code block with some SQL queries inside. If you click the "Run in Playground" button, it runs the query right there in the web browser with SQLite compiled to WebAssembly.

After running that SQLite query in my browser, I thought, "Can I do this with Postgres?".

The modern web browser is a very powerful platform, and this platform's capabilities are constantly increasing. However, WebAssembly still has some growing to do in some areas. After some quick research, I found that the web browser simply does not offer the networking features that Postgres needs. That would seem like a pretty big obstacle.

However...

As I mentioned, the modern web browser is a very powerful platform. Let's just change the target platform to something other than WebAssembly, then run it in WebAssembly anyway like a rebel. 😎

Virtual machines in the browser

It's actually possible to emulate a PC inside the web browser! There have been quite a few implementations over the years. Some even started out in JavaScript, years before WebAssembly was an option. Here are a few that I found especially interesting:

Emulator	Architecture
Halfix	x86
JSLinux	x86 and RISC-V
jor1k	OpenRISC 1000
v86	x86
WebVM	x86

I ended up choosing v86 for this. The author started the project in 2011, and it's still active. Many questions have been answered in the GitHub issues and discussions over the years. I'm definitely not an expert in Linux or virtualization, so being able to search the repo for answers was very helpful.

v86's performance also seemed to be among the best, compared to similar open source emulators that run in the browser. In early 2021, they merged a Rust port + JIT into the master branch, which provided a significant performance boost over the original JavaScript implementation.

Build

For this blog post, we'll use Alpine Linux. It's a lightweight Linux distribution and a very popular base for many Docker images. They also have a version with a slimmed-down kernel, optimized for virtual machines.

Install Alpine Linux

Note: You'll need to have QEMU installed.

Download the Alpine image that is optimized for VMs. We'll need the x86 (not x86_64) build.

wget https://dl-cdn.alpinelinux.org/alpine/v3.16/releases/x86/alpine-virt-3.16.0-x86.iso

Create a disk image for the VM. I chose my disk size somewhat randomly, so feel free to make it larger if you'd like. You could probably make it smaller, but I'm not sure what the minimum size limit would be here.

qemu-img create alpine.img 512M

Start the VM

qemu-system-x86_64 \
  -m 256M \
  -cdrom alpine-virt-3.16.0-x86.iso \
  -drive file=alpine.img,format=raw

Note: Depending on your host machine, you may be able to get a performance boost by appending another argument to the command:

Host machine	Argument
Linux with x86 CPU	`-accel kvm`
macOS with x86 (Intel) CPU	`-accel hvf`
macOS with Apple Silicon	None, but your CPU is a beast anyway.

When the VM has finished booting, log in as 'root'.

Run setup-alpine. For most of the setup questions, you can configure things to your liking, but these are important:

Question	Answer
`Which disk(s) would you like to use?`	`sda`
`How would you like to use it?`	`sys`
`WARNING: Erase the above disk(s) and continue?`	`y`

After the installation has finished, use the reboot command to reboot the VM from the virtual hard disk image.

Install Postgres

After the VM has rebooted, log in as root again. We can now install and initialize Postgres.

apk add postgresql --no-cache
/etc/init.d/postgresql setup
/etc/init.d/postgresql start
rc-update add postgresql

Now, a quick smoke test:

su - postgres -c 'psql -c "SELECT version();"'

If it worked, you should see something like this:

                                                    version
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 14.4 on i586-alpine-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 32-bit
(1 row)

Note: On my machine, psql opened the version information in less, so I had to press q to exit that.

Now we can shut down the VM.

poweroff

Run the VM in the web browser

Go to v86's site.
Scroll down to the "Setup" section.
For "Hard disk drive image", select the alpine.img file you created earlier.
Adjust "Memory size" to your liking. (I used 256 MB)
Click the "Start Emulation" button.

Note: It will take a bit longer for the VM to boot. Running the VM in the web browser will not be as fast as it was in QEMU.

After it has finished booting, log in as root, then open psql:

su - postgres -c psql

Congratulations! You are now running Postgres in your web browser.

Things to keep in mind:

There is no internet access from inside the VM.
There is no data persistence, so changes are lost when leaving or refreshing the page.

Rise of the Anti-Join

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Mon, 15 Aug 2022 08:00:00 EDT

Find me all the things in set "A" that are not in set "B".

This is a pretty common query pattern, and it occurs in both non-spatial and spatial situations. As usual, there are multiple ways to express this query in SQL, but only a couple queries will result in the best possible performance.

Setup

The non-spatial setup starts with two tables with the numbers 1 to 1,000,000 in them, then deletes two records from one of the tables.

CREATE TABLE a AS SELECT generate_series(1,1000000) AS i
ORDER BY random();
CREATE INDEX a_x ON a (i);

CREATE TABLE b AS SELECT generate_series(1,1000000) AS i
ORDER BY random();
CREATE INDEX b_x ON b (i);

DELETE FROM b WHERE i = 444444;
DELETE FROM b WHERE i = 222222;

ANALYZE;

The spatial setup is a 2M record table of geographic names, and a 3K record table of county boundaries. Most of the geonames are inside counties (because we tend to names things on land) but some of them are not (because sometimes we name things in the ocean, or our boundaries are not detailed enough).

Subqueries? No.

Since the problem statement includes the words "not in", this form of the query seems superficially plausible:

SELECT i
  FROM a
  WHERE i NOT IN (SELECT i FROM b);

Perfect! Give me everything from "A" that is not in "B"! Just what we want? In fact, running the query takes so long that I never got it to complete. The explain gives some hints.

                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=1000.00..5381720008.33 rows=500000 width=4)
   Workers Planned: 2
   ->  Parallel Seq Scan on a  (cost=0.00..5381669008.33 rows=208333 width=4)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..23331.97 rows=999998 width=4)
                 ->  Seq Scan on b  (cost=0.00..14424.98 rows=999998 width=4)

Note that the subquery ends up materializing the whole second table into memory, where it is scanned over and over and over to test each key in table "A". Not good.

Except? Maybe.

PostgreSQL supports some set-based key words that allow you to find logical combinations of queries: UNION, INTERSECT and EXCEPT.

Here, we can make use of EXCEPT.

SELECT a.i FROM a
EXCEPT
SELECT b.i FROM b;

The SQL still matches our mental model of the problem statement: everything in "A" except for everything in "B".

And it returns a correct answer in about 2.3 seconds.

   i
--------
 222222
 444444
(2 rows)

The query plan is interesting: the two tables are appended and then sorted for duplicates and then only non-dupes are omitted!

                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 SetOp Except  (cost=322856.41..332856.40 rows=1000000 width=8)
   ->  Sort  (cost=322856.41..327856.41 rows=1999998 width=8)
         Sort Key: "*SELECT* 1".i
         ->  Append  (cost=0.00..58849.95 rows=1999998 width=8)
               ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..24425.00 rows=1000000 width=8)
                     ->  Seq Scan on a  (cost=0.00..14425.00 rows=1000000 width=4)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..24424.96 rows=999998 width=8)
                     ->  Seq Scan on b  (cost=0.00..14424.98 rows=999998 width=4)

It's a big hammer, but it works.

Anti-Join? Yes.

The best approach is the "anti-join". One way to express an anti-join is with a special "correlated subquery" syntax:

SELECT a.i
  FROM a
  WHERE NOT EXISTS
    (SELECT b.i FROM b WHERE a.i = b.i);

So this returns results from "A" only where those results result in a no-record-returned subquery against "B".

It takes about 850 ms on my test laptop, so 3 times faster than using EXCEPT in this test, and gets the right answer. The query plan looks like this:

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Gather  (cost=16427.98..31466.36 rows=2 width=4)
   Workers Planned: 2
   ->  Parallel Hash Anti Join  (cost=15427.98..30466.16 rows=1 width=4)
         Hash Cond: (a.i = b.i)
         ->  Parallel Seq Scan on a  (cost=0.00..8591.67 rows=416667 width=4)
         ->  Parallel Hash  (cost=8591.66..8591.66 rows=416666 width=4)
               ->  Parallel Seq Scan on b  (cost=0.00..8591.66 rows=416666 width=4)

The same sentiment can be expressed without the NOT EXISTS construct, using only basic SQL and a LEFT JOIN:

SELECT a.i
  FROM a
  LEFT JOIN b ON (a.i = b.i)
  WHERE b.i IS NULL;

This also takes about 850 ms.

The LEFT JOIN is required to return a record for every row of "A". So what does it do if there's no record in "B" that satisfies the join condition? It returns NULL for the columns of "B" in the join relation for those records. That means any row with a NULL in a column of "B" that is normally non-NULL is a record in "A" that is not in "B".

Now do Spatial

The nice thing about the LEFT JOIN expression of of the solution is that it generalizes nicely to arbitrary join conditions, like those using spatial predicate functions.

"Find the geonames points that are not inside counties"... OK, we will LEFT JOIN geonames with counties and find the records where counties are NULL.

SELECT g.name, g.geonameid, g.geom
  FROM geonames g
  LEFT JOIN counties c
    ON ST_Contains(c.geom, g.geom)
  WHERE g.geom IS NULL;

The answer pops out in about a minute.

Unsurprisingly, that's about how long a standard inner join takes to associate the 2M geonames with the 3K counties, since the anti-join has to do about that much work to determine which records do not match the join condition.

Conclusion

"Find the things in A that aren't in B" is a common use pattern.
The "obvious" SQL patterns might not be the most efficient ones.
The WHERE NOT EXISTS and LEFT JOIN patterns both result in the most efficient query plans and executions.

Quick and Easy Postgres Data Compare

Brian.Pace@crunchydata.com (Brian Pace) — Wed, 15 Jun 2022 11:00:00 EDT

If you're checking archives or working with Postgres replication, data reconciliation can be a necessary task. Row counts can be one of the go to comparison methods but that does not show data mismatches. You could pull table data across the network and then compare each row and each field, but that can be a demand on resources. Today we'll walk through a simple solution for your Postgres toolbox - using Foreign Data Wrappers to connect and compare the two source datasets. With the foreign data wrapper and a little sql magic, we can compare data quickly and easily.

Creating Environments

To keep the environment simple so even with limited resources it can be practiced, we will use a single PostgreSQL cluster with two databases (hrprod, hrreport) connected via PostgreSQL Foreign Data Wrapper. The simulation here is a production database (hrprod) with a reporting database (hrreport). Keep in mind that the source and target do not have to be within the same PostgreSQL cluster.

For speed of creating the environment, the Crunchy Postgres for Kubernetes was used and a simple PostgreSQL cluster deployed using the Postgres Operator Examples repository.

The rest of the steps will only show the steps performed within psql from the database containers.

Production Setup (hrprod)

The steps to create the simulated production database is simple: create the database, create the postgres_fdw extension, create the employee table and lastly populate the employee table with three rows of data.

postgres=> create database hrprod;
CREATE DATABASE

postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> create extension postgres_fdw;
CREATE EXTENSION

hrprod=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrprod=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3

Reporting Setup (hrreport)

The steps are then repeated to create the simulated reporting database.

postgres=> create database hrreport;
CREATE DATABASE

postgres=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> create extension postgres_fdw;
CREATE EXTENSION

hrreport=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrreport=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3

With this, the setup is complete and the data in the employee table match in both databases.

Data Compare

The compare will be performed from the reporting database side (hrreport). To start, a temporary table named data_compare is created. The data_compare table is to store three pieces of information:

source_name column that identifies where the data came from (hrprod or hrreport in this example).
id column that will store the value(s) of the primary key from the table.
hash_value column that stores the hash value of all the non-key fields in the table.

Note that if the table has a composite key, the id column would be populated by joining the values into a single string. The hash occurs on the source side and only the hashed value is used for the comparison, greatly reducing network traffic, transfer time, etc.

Setup Data Compare

Create the data_compare table in both the production (hrprod) and target (hrreport) databases.

hrreport=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE

An INSERT statement will be executed on both the source and target to populate the data_compare table and then the contents of the tables compared to identify differences. To reduce time and transfer for multiple compare passes, the data_compare table contents can be transferred via the foreign table or pg_dump, etc.

The following steps were used to create the foreign table.

hrreport=> CREATE SERVER hrprod FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'hrprod', port '5432');
CREATE SERVER

hrreport=> CREATE USER MAPPING FOR current_user SERVER hrprod options (user 'postgres', password 'welcome1');
CREATE USER MAPPING

CREATE FOREIGN TABLE hrprod_data_compare (source_name varchar(140), id varchar(1000), hash_value varchar(100)) SERVER hrprod OPTIONS (table_name 'data_compare');

Perform Initial Compare

Populate the data_compare table in both the source (hrprod) and target (hrreport) databases.

hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3


hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3

At this point we know that the data is exactly the same so let's look at the SQL that is used to perform the actual comparison.

hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
(3 rows)

The compare_result confirms that two sets of data are equal. An alternate compare SQL is included at the end of this article to show various ways the data can be compared when the two data_compare tables are combined.

Create an Out-Of-Sync Condition and Compare

At this stage, three rows exist in the table and the data matches.

hrprod=> SELECT * FROM employee;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
(3 rows)

To create the out of sync, the following changes will be performed:

In hrprod, add CS Lewis with id 4, Charles Babbage with id 5, Blaise Pascal with id 6.
In hrreport, add Charles Babbage with id 4, CS Lewis with id 5, Kenny Rogers with id 7.

Notice that the ids for CS Lewis and Charles Babbage have been swapped and a unique record added to each database (Blaise Pascal to hrprod and Kenny Rogers to hrreport). The compare should show that 3 rows match, 2 rows have differences and 2 rows are in one database but not the other.

Up first, changes to source (hrprod).

hrprod=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (4,'CS','Lewis','author'),(5,'Charles','Babbage','math'),(6,'Blaise','Pascal','math');

hrprod=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | CS         | Lewis      | author
  5 | Charles    | Babbage    | math
  6 | Blaise     | Pascal     | math
(6 rows)

Now the changes to the target (hrreport).

hrreport=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (5,'CS','Lewis','author'),(4,'Charles','Babbage','math'),(7,'Kenny','Rogers','music');

hrreport=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | Charles    | Babbage    | math
  5 | CS         | Lewis      | author
  7 | Kenny      | Rogers     | music
(6 rows)

To summarize the current state:

Three rows that match (id=1, 2, 3)
Two rows that do not match (id=4, id=5)
Two rows that exist in one but not the other (id=6, id=7)

Let's now clear the data_compare tables and perform the compare again.

postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> DELETE FROM data_compare;
DELETE 3

hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> DELETE FROM data_compare;
DELETE 3

hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6

Now for the compare and the results.

hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
 4  | bbee9d6cccbeac4e9125ec78507c4eb7 | 57acef6ed228a52b8c42f0a6c155e62b | difference
 5  | 57acef6ed228a52b8c42f0a6c155e62b | bbee9d6cccbeac4e9125ec78507c4eb7 | difference
 6  | 047742fb256df0b78cebc3fbbc3ca4ad |                                  | row not on target
 7  |                                  | 66e5e35673780bd392d2f81d589fbb52 | row not on source
 (7 rows)

The above output indicates that rows with id = 1 thru 3 exists in both databases and the content of the rows match. Rows with id 4 and 5 exists in each database but the contents of the row is different. Going a step further, one could see that the hash values are the same between the two different rows but associated to the wrong id. Row with id 6 only exist on the target (hrreport) while the row with id 7 only exists on the source (hrprod). In total, there are 4 rows that are out of sync.

With the rows identified, proper steps can be performed to sync the appropriate rows. Last thought, imagine for a moment that logical replication was in place between the two databases and changes were pending on the target due to lag. The INSERT into the data_compare could be performed only on the rows flagged as out of sync to verify just those rows once replication lag is gone.

Conclusion

Comparing data can be a monumental task. However, this little trick has come in handy over the years when expensive data compare software packages were not an option. There is still room for some creativity with the compare SQL to meet the exact needs of the compare. For example, only show rows that are missing from one side or the other.

Alternate Compare SQL:

SELECT id, hash_value,
       count(src1) src1,
       count(src2) src2
 FROM
     ( SELECT a.*,
              1 src1,
              null src2
        FROM data_compare a
        WHERE source_name='hrprod'
        UNION ALL
        SELECT b.*,
               null src1,
               2 src2
        FROM data_compare b
        WHERE source_name='hrreport'
    ) c
 GROUP BY id, hash_value
 HAVING count(src1) <> count(src2);

So by setting up postgres_fdw, hashing the non-key fields, and writing a sql query to see if any rows are different - you can do a quick and simple Postgres data comparison. Have another solution you like for data compare? Let us know at @crunchydata.

Parquet and Postgres in the Data Lake

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Tue, 03 May 2022 16:00:00 EDT

.black-box { background-color: black; color: white; padding: 20px; text-align: left; align-items: left; margin: 20px auto; border-radius: 10px; width: auto; height: auto; } .black-box a { color: white; text-decoration: underline; }

Interested in Spatial analytics? You can now connect Postgres and PostGIS to CSV, JSON, Parquet / GeoParquet, Iceberg, and more with Crunchy Data Warehouse.

Static Data is Different

A couple weeks ago, I came across a blog from Retool on their experience migrating a 4TB database. They put in place some good procedures and managed a successful migration, but the whole experience was complicated by the size of the database. The size of the database was the result of a couple of very large "logging" tables: an edit log and an audit log.

The thing about log tables is, they don't change much. They are append-only by design. They are also queried fairly irregularly, and the queries are often time ranged: "tell me what happened then" or "show me the activities between these dates".

So, one way the Retool migration could have been easier is if their log tables were constructed as time-ranged partitions. That way there'd would only be one "live" table in the partition set (the one with the recent entries) and a larger collection of historical tables. The migration could move the live partition as part of the critical path, and do all the historical partitions later.

Even after breaking up the log tables into manageable chunks they still remain, in aggregate, pretty big! The PostgreSQL documentation on partitioning has some harsh opinions about stale data living at the end of a partition collection:

The simplest option for removing old data is to drop the partition that is no longer necessary.

There's something to that! All those old historical records just fluff up your base backups, and maybe you almost never have occasion to query it.

Is there an alternative to dropping the tables?

Dump Your Data in the Lake

What if there was a storage option that was still durable, allowed access via multiple query tools, and could integrate transparently into your operational transactional database?

How about: storing the static data in Parquet format but retaining database access to the data via the parquet_fdw?

Sounds a bit crazy, but:

A foreign parquet table can participate in a partition along with a native PostgreSQL table.
A parquet file can also be consumed by R, Python, Go and a host of cloud applications.
Modern PostgreSQL (14+) can parallelize access to foreign tables, so even collections of Parquet files can be scanned effectively.
Parquet stores data compressed, so you can get way more raw data into less storage.

Wait, Parquet?

Parquet is a language-independent storage format, designed for online analytics, so:

Column oriented
Typed
Binary
Compressed

A standard table in PostgreSQL will be row-oriented on disk.

This layout is good for things PostgreSQL is expected to do, like query, insert, update and delete data a "few" records at a time. (The value of "a few" can run into the hundreds of thousands or millions, depending on the operation.)

A Parquet file stores data column-oriented on the disk, in batches called "row groups".

You can see where the Parquet format gets its name: the data is grouped into little squares, like a parquet floor. One of the advantages of grouping data together, is that compression routines tend to work better on data of the same type, and even more so when the data elements have the same values.

Does This Even Work?

In a word "yes", but with some caveats: Parquet has been around for several years, but the ecosystem supporting it is still, relatively, in flux. New releases of the underlying C++ libraries are still coming out regularly, the parquet_fdw is only a couple years old, and so on.

However, I was able to demonstrate to my own satisfaction that things were baked enough to be interesting.

Loading Data

I started with a handy data table of Philadelphia parking infractions, that I used in a previous blog post on spatial joins, and sorted the file by date of infraction, issue_datetime.

Data Download and Sort