Brian Pace | CrunchyData Blog

Postgres Parallel Query Troubleshooting

Brian.Pace@crunchydata.com (Brian Pace) — Mon, 10 Feb 2025 10:30:00 EST

Postgres' ability to execute queries in parallel is a powerful feature that can significantly improve query performance, especially on large datasets. However, like all resources, parallel workers are finite. When there aren't enough available workers, Postgres may downgrade a parallel query to a serial (non-parallel) execution. This sounds reasonable unless the performance of the downgraded query is well beyond the required response times needed by the application.

While helping our clients with Oracle to PostgreSQL migrations, query downgrading is a common challenge. In Oracle 11.2, Oracle introduced a feature called "Parallel Statement Queuing." This feature prevents downgrades by queuing parallel queries until enough parallel PX servers are available to handle the request.

This post explores how parallel queries work, what triggers downgrades, and how you can monitor and optimize parallel worker usage to prevent performance bottlenecks. We'll also explore a sample solution that mirrors Oracle's Parallel Statement Queuing feature.

How Parallel Queries Work

When PostgreSQL executes a query in parallel, it divides the work of one or more query nodes (tasks) across multiple processes called parallel workers. These workers cooperate to process parts of the data simultaneously, reducing query time for operations like scans, joins, and aggregations. The database allocates parallel workers up to the maximum defined by the max_parallel_workers configuration setting. If parallel workers cannot be allocated, the query is downgraded to serial execution.

Causes of Parallel Query Downgrades

There are a few key reasons why a parallel query may be downgraded:

Exhausted Worker Pool PostgreSQL has a limit on the total number of parallel workers it can spawn, controlled by the max_parallel_workers parameter. If this limit is reached, new parallel queries cannot get the workers they need and may fall back to serial execution.
Per-Query Worker Limit Even if there are available workers, each query is subject to the max_parallel_workers_per_gather setting. If this threshold is met or exceeded, additional queries must either run with reduced parallelism or downgrade to serial.
Busy Workload In a busy system where many queries are requesting parallel workers, competition for resources may lead PostgreSQL to downgrade some queries to avoid overloading the system.
Optimizer Stats Statistics on the table and indexes can lead the optimizer to choose a serial execution path over a parallel path.

Simulating the Issue

Create a Large Table

The following SQL will create a large table that will be queried to simulate the benefits of parallel query and the impact when a query is downgraded.

CREATE TABLE large_table AS
SELECT generate_series(1, 10000000) AS id,
       random() * 1000 AS value;

Run a Parallel Query:

SET max_parallel_workers_per_gather = 4;

EXPLAIN (ANALYZE)
SELECT * FROM large_table WHERE value > 500 ORDER BY id DESC;

The query above with 4 parallel workers runs in less than 3 seconds. Below is the execution plan returned from the EXPLAIN ANALYZE:

Gather Merge  (cost=233180.33..827680.90 rows=4965151 width=12) (actual time=1683.415..2367.503 rows=5000337 loops=1)
  Workers Planned: 4
  Workers Launched: 4
  ->  Sort  (cost=232180.27..235283.49 rows=1241288 width=12) (actual time=1653.855..1743.384 rows=1000067 loops=5)
        Sort Key: id DESC
        Sort Method: external merge  Disk: 25640kB
        Worker 0:  Sort Method: external merge  Disk: 25136kB
        Worker 1:  Sort Method: external merge  Disk: 25616kB
        Worker 2:  Sort Method: external merge  Disk: 25496kB
        Worker 3:  Sort Method: external merge  Disk: 25536kB
        ->  Parallel Seq Scan on large_table  (cost=0.00..85327.28 rows=1241288 width=12) (actual time=0.014..191.271 rows=1000067 loops=5)
              Filter: (value > '500'::double precision)
              Rows Removed by Filter: 999933
Planning Time: 0.215 ms
Execution Time: 2511.247 ms
(15 rows)

Executing the query again but this time disabling parallel query to simulate the downgrade of the query.

SET max_parallel_workers_per_gather = 0;

EXPLAIN (ANALYZE)
SELECT * FROM large_table WHERE value > 500 ORDER BY id DESC;

With parallel query disabled, the query response time is now just over 10 seconds. In this simple example, 7 seconds may not seem like a big deal. However, imagine that the response time for a real world example is not 7 seconds but 7 minutes. This is the type of performance degradation we want to prevent. The below output shows the non-parallel execution plan.

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Sort  (cost=816132.49..828545.37 rows=4965151 width=12) (actual time=10314.880..10699.870 rows=5000337 loops=1)
  Sort Key: id DESC
  Sort Method: external merge  Disk: 127368kB
  ->  Seq Scan on large_table  (cost=0.00..179069.14 rows=4965151 width=12) (actual time=0.012..691.074 rows=5000337 loops=1)
        Filter: (value > '500'::double precision)
        Rows Removed by Filter: 4999663
Planning Time: 0.199 ms
Execution Time: 10872.894 ms
(8 rows)

How to Detect Downgraded Parallel Queries

When a parallel query is downgraded to a serial execution, it can result in longer query times. Fortunately, PostgreSQL provides several ways to identify such downgrades.

Using EXPLAIN (ANALYZE)

Reviewing the output of the execution plans above, when the query was downgraded, the plan lacked parallel nodes.

If the query is serial, the output will look like this:

Seq Scan on large_table  (cost=0.00..179069.14 rows=4965151 width=12) (actual time=0.012..691.074 rows=5000337 loops=1)

But a parallel query would have entries like:

Parallel Seq Scan on large_table  (cost=0.00..85327.28 rows=1241288 width=12) (actual time=0.014..191.271 rows=1000067 loops=5)

Monitoring Parallel Workers in Use

You can query the pg_stat_activity view to see the number of running parallel workers:

SELECT COUNT(1) AS running_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';

Additionally, you can compare it against the total allowed workers:

SELECT current_setting('max_parallel_workers')::int AS max_workers;

If running_workers reaches or exceeds max_workers, new parallel queries may be downgraded.

Solutions for parallel query downgrades

One solution that we have implemented with customers is a retry function. This function mirrors the behavior of Oracle’s Parallel Query Statement Queuing. The function will check for a certain percentage of parallel workers to be available before executing the desired statement. If there are no workers available, the function will sleep and recheck for a specified period of time. An example of this procedure can be found here.

Conclusion

Parallel queries in PostgreSQL offer great performance benefits, but they rely on the availability of parallel workers. When workers are in short supply, the database gracefully downgrades queries to ensure system stability, though at the cost of performance. By understanding how parallel query downgrades occur, you can better manage parallel workloads and minimize the impact on your system.

Careful tuning, monitoring, and query optimization will help you get the most out of PostgreSQL's parallel query feature—without running into unexpected slowdowns.

PostgreSQL Snapshots and Backups with pgBackRest in Kubernetes

Brian.Pace@crunchydata.com (Brian Pace) — Thu, 05 Sep 2024 11:00:00 EDT

Backups are dead. Now that I have your attention, let me clarify. Traditional backups have earned a solid reputation for their reliability over time. However, they are dead in the sense that a backup is essentially useless until it's restored—essentially "resurrected." In this post, we'll explore best practices for managing PostgreSQL snapshots and backups using pgBackRest. We will then provide some guidance of how you apply these techniques in Kubernetes using the Postgres Operator (PGO) from Crunchy Data. Whether you're overseeing a production environment, handling replicas, or refreshing lower environments, understanding how to effectively manage snapshots is key.

Creating snapshots

There are two effective methods for creating snapshots, but before we dive into those, let's address a common but ill-advised solution.

You shouldn't snapshot the primary PostgreSQL instance

When working with PostgreSQL, it's crucial to avoid taking snapshots of the primary instance or running replicas for a couple of reasons:

Volume Overhead: Snapshotting the primary instance can impose unnecessary overhead on the underlying volume, potentially affecting performance.
Risk of Corruption: If the database contains a corrupt block, it can propagate to the snapshots, compromising the integrity of your backups and hindering data recovery.
Backup Label Management: To snapshot a running instance, you need to execute pg_backup_start and pg_backup_stop. The output of the stop command must be stored, and the appropriate content injected into the backup_label file if the clone is used.

To avoid these issues, I recommend two alternative approaches.

Option 1: Delta restores with pgBackRest

The first and preferred approach is to use pgBackRest for delta restores. When you snapshot a PostgreSQL instance, there's a risk of corrupt blocks being included, endangering your snapshots. pgBackRest adds a layer of protection by checking for corrupt blocks during the backup. If the previous backup contains a questionable block or any other error, the snapshot is skipped. Additionally, pgBackRest verifies the block during restoration, providing two layers of protection for your snapshots. To start, create a persistent volume claim that will be used for the delta restore. This PVC will be mounted to the restore job each time and will also be the PVC against which the snapshot is taken after each restore. The restore job should follow these high-level steps:

Mount the delta restore PVC
Check the last pgBackRest backup for errors (abort if errors are found)
Perform a checksum on $PGDATA/backup_label
Execute the delta restore with pgBackRest
Verify the backup_label checksum after the restore matches the previous checksum (if unchanged, end the job)
Snapshot the delta restore PVC
Repeat after each pgBackRest backup or as per the desired schedule

Option 2: Using a standby replica

The second approach involves using a standby replica. The advantage here is that snapshots can be taken without waiting for a backup, allowing for increased snapshot frequency. A job can be submitted to perform the snapshot, following these high-level steps:

Ensure the replica is up-to-date by comparing the source LSN to the last applied LSN in the replica
Shut down the PostgreSQL standby replica (setting spec.shutdown to true in the Postgres Cluster manifest if using the Postgres Operator)
Snapshot the replica's PVC
Restart the PostgreSQL replica (setting spec.shutdown to false in the Postgres Cluster manifest if using the Postgres Operator)
Verify that replication has resumed correctly

Consuming snapshots

Now let's use the Postgres Operator (PGO) from Crunchy Data to automate the process of using the snapshots. A common scenario involves refreshing a User Acceptance Test (UAT) database from production. Here's how to do it:

Identifying existing snapshots

The first step is to identify the snapshot we want to use. We can list available snapshots using kubectl:

kubectl get volumesnapshot -n crunchy-snap -o=custom-columns=NAME:.metadata.name,STATUS:.status.readyToUse

NAME                                 STATUS
acmeprod-replica-snapshot-20240830   true

Creating a PostgreSQL clone from a snapshot

Once we've identified the snapshot, the next step is to create a new Persistent Volume Claim (PVC) from it. This is where a Kubernetes Operator like the PGO can really add value. By simply specifying the desired end state to the Postgres Operator, the operator handles the details.

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: acmeuat
spec:
  # Postgres Clone Operation Instructions
  dataSource:
    volumes:
      pgDataVolume:
        pvcName: acmeauat-replica-snapshot-restore
  image: registry.crunchydata.com/crunchydata/crunchy-postgres:ubi8-16.3-2
  port: 5432
  postgresVersion: 16
  instances:
    - name: 'uat'
      replicas: 1
      dataVolumeClaimSpec:
        accessModes:
          - 'ReadWriteOnce'
        # Identify the snapshot to be used
        dataSource:
          apiGroup: snapshot.storage.k8s.io
          kind: VolumeSnapshot
          name: acmeprod-replica-snapshot-20240830
        dataSourceRef:
          apiGroup: snapshot.storage.k8s.io
          kind: VolumeSnapshot
          name: acmeprod-replica-snapshot-20240830
        resources:
          requests:
            storage: 100Gi
        storageClassName: ssd-csi
        # Name the volume the same as the pvcName specified under spec.dataSource.
        volumneName: acmeauat-replica-snapshot-restore
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          shared_buffers: 512MB
          work_mem: 10MB
  backups:
    pgbackrest:
      image: registry.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.51-2
      repos:
        - name: repo1
          volume:
            volumeClaimSpec:
              accessModes:
                - 'ReadWriteOnce'
              resources:
                requests:
                  storage: 100Gi

Submitting this configuration instructs the Postgres Operator to create a new cloned environment using the storage snapshot.

Instructions for clone operations

The spec.dataSource section tells the Operator that this is a clone operation and specifies which PVC will contain the staged PostgreSQL data. These clones include everything necessary to bring the PostgreSQL instance online and recover to a consistent state. This is possible due to two main scenarios: either the source was a pgBackRest delta restore with the archive-copy option used during backup, or the snapshot was taken from a cleanly shutdown PostgreSQL instance.

Instructions for snapshot operations

In spec.instances[0].dataVolumeClaimSpec, two sections guide the Postgres Operator to create a persistent volume claim based on a specific VolumeSnapshot: spec.instances[0].dataVolumeClaimSpec.dataSource and spec.instances[0].dataVolumeClaimSpec.dataSourceRef. In our example, both reference the VolumeSnapshot we identified earlier (acmeprod-replica-snapshot-20240830). Finally, the Operator is instructed to name the newly created persistent volume claim the same as the pvcName specified under spec.dataSource—in this case, acmeuat-replica-snapshot-restore.

Additional considerations for using snapshots

If we wanted to roll the cloned copy forward to a specific point in time, we could include a pgBackRest section under spec.dataSource. This would require pgBackRest to use an object storage solution as one of its repositories. Here's an example:

dataSource:
  volumes:
    pgDataVolume:
      pvcName: acmeauat-replica-snapshot-restore
    pgbackrest:
      options:
        - --type=time
        - --target="2024-08-30 12:30:00"
      configuration:
        - secret:
          name: s3-confuat
      stanza: db
      repo:
        name: repo2
        s3:
        bucket: 'acmeprod-pgbackrest-repo'
        endpoint: 's3.openshift-storage.svc:443'
        region: 'us'

Ideally, the storage provider would snapshot the existing snapshot, mounting it back to avoid moving data. However, depending on the provider, data might still be copied internally, which is faster than moving it across different infrastructures. In either case, there are still a lot of advantages in using snapshots.

Conclusion

Effectively managing PostgreSQL snapshots and backups requires a strategic approach. By using delta restores with pgBackRest or leveraging a standby replica, you can reduce risks and enhance your backup strategy. Whether you're managing production databases or refreshing environments, these methods offer a reliable and efficient solution. Using a Kubernetes Operator, like the PGO from Crunchy Data simplifies the process of consuming snapshots across various use cases. Both snapshot options discussed provide "virtual full copies" of the database, which are efficient in terms of disk usage—allowing multiple "full copies" while consuming disk space only for the changes between snapshots.If you're interested in trying these methods or need assistance with setting up snapshot jobs, feel free to reach out. You can get started with these examples with Crunchy Postgres for Kubernetes using the quickstart. Stay tuned for more, as the Crunchy Data engineering team has exciting plans for further automating snapshots in the near future.

Introducing pgCompare: The Ultimate Multi-Database Data Comparison Tool

Brian.Pace@crunchydata.com (Brian Pace) — Fri, 31 May 2024 06:00:00 EDT

In the evolving world of data management, ensuring consistency and accuracy across multiple database systems is paramount. Whether you're migrating data, synchronizing systems, or performing routine audits, the ability to compare data across different database platforms is crucial. Enter pgCompare, an open-source tool designed to simplify and enhance the process of data comparison across PostgreSQL, Oracle, MySQL, and MSSQL databases.

The key features of pgCompare:

Multi-Database support: pgCompare stands out with its ability to connect and compare data across four major database systems: PostgreSQL, Oracle, MySQL, and MSSQL. This multi-database support is crucial for organizations managing a variety of database technologies.
Comparison report: pgCompare generates detailed reports highlighting the differences between datasets. These reports include information about missing records, mismatched values, and summary statistics, enabling users to quickly identify and address inconsistencies.
Stored results: Results are stored in a Postgres database for tracking historical compares, current status, and alerting.
Flexible comparison options: Users can customize their comparisons with various options such as transforming data and excluding specific columns. This flexibility ensures that comparisons are tailored to meet specific requirements.
Performance and scalability: Built with performance in mind, pgCompare efficiently handles large datasets with minimal impact to source and target systems. Its flexible architecture ensures that it can meet the needs of both small and large datasets.

Getting Started with pgCompare

PgCompare is an open source tool, free to use for anyone, and getting started with pgCompare is simple. The tool can be downloaded from the official git repository, https://github.com/CrunchyData/pgCompare, where users will find detailed documentation and tutorials to help them configure and run their first comparisons. With its robust feature set and ease of use, pgCompare is set to become an indispensable tool for database professionals.

pgCompare runs as an application in the location of your choice, either a local machine or remote one closer to your data store. pgCompare creates a separate Postgres database for running the queries to fetch data from your remote data stores. You’ll configure the details for your comparison in the dc_table.

After compiling the Java source code (see readme for details), the first step is to copy the pgcompare.properties.sample file topgcompare.properties and make the necessary edits for the repository, target, and source databases. With the properties file in place, use pgcompare to initialize the repository.

java -jar pgcompare.jar --init

There is a sample table available in the database directory of the git repository. If you do not have tables already in place, deploy the HR.EMP table to the source and target database of your choice.

Last step before executing a compare is to register the tables with the pgCompare repository. To do this, simply execute pgCompare with the discovery flag followed by the schema it should perform the discovery against (hr in this example).

java -jar pgcompare.jar --discovery hr

To compare the databases, you’ll run something like this:

java -jar pgcompare.jar --batch=0

The summary output of the compare will appear at the end of the job:

Reconciliation Complete:  Table = emp; Equal = 21; Not Equal = 1; Missing Source = 1; Missing Target = 0
Processed 1 tables
Table Summary: Table = emp                           ; Status = out-of-sync ; Equal =                  21; Not Equal =                   1; Missing Source =                   1; Missing Target =                   0
Run Summary:  Elapsed Time (seconds) = 7; Total Rows Processed = 23; Total Out-of-Sync = 2; Through-put (rows/per second) = 3

Last, if there are out of sync rows, details on each row as well as a revalidation can be performed using the check option:

java -jar pgcompare.jar --batch=0 --check

The details on out of sync rows will appear as the check is performed:

Primary Key: {"eid":23}
  Out-of-Sync:  PK = {"eid": 23};  Differences = [{"LAST_NAME":{"source":"Runner","target":"Pace"}}]
Primary Key: {"eid":22}
  Out-of-Sync:  PK = {"eid": 22};  Differences = ["Missing Source"]

Use Cases for pgCompare

Data Migration

When migrating data from one database platform to another, ensuring that all records have been accurately transferred is critical. For example, the Crunchy database migration team uses this tool to validate data during Oracle to Postgres migrations. It is also useful to create a data validation artifact that verifies data consistency before decommissioning old systems.

Data Synchronization

For organizations that run multiple databases concurrently, maintaining synchronization between these systems is essential. The demand for active/active configuration continues to grow. These solutions use logical replication which introduces risk. To control this risk, introduce a compensating control, pgCompare helps in regularly checking and syncing data across different databases.

Regulatory Compliance

Many industries require regular audits to ensure data accuracy and compliance with regulations. pgCompare simplifies the auditing process by providing clear and detailed comparison reports. Auditors and regulators always require evidence that data divergence is not occurring. The output from pgCompare is useful in meeting this requirement.

Quality Assurance

In development and testing environments, pgCompare can be used to verify that data remains consistent across various stages of application development and deployment. If testing is performed against incorrect or outdated data, it could add risk for production releases.

Why pgCompare is a game changer

The traditional methods of data comparison often involve manual processes or scripts that are prone to errors and require significant maintenance. Many solutions rely on comparing row counts which does not prove the data is indeed equal. pgCompare revolutionizes this process by providing a reliable, automated solution that reduces the risk of errors and saves valuable time.

Efficiency: Automating data comparison reduces the time and effort required for manual checks, allowing database administrators and data engineers to focus on more strategic tasks.
Accuracy: By leveraging advanced algorithms, pgCompare ensures precise identification of discrepancies, enhancing data integrity.
Integration: With support for multiple databases, pgCompare seamlessly integrates into diverse IT environments, making it a versatile tool for any organization.

In a world where data accuracy and consistency are paramount, pgCompare offers a reliable, efficient, and scalable solution for comparing data across PostgreSQL, Oracle, MySQL, and MSSQL databases. Whether you're a database administrator, data engineer, or IT manager, pgCompare is the tool you need to ensure your data remains consistent and reliable.

Embrace the future of data comparison with pgCompare and transform the way you manage your multi-database environment.

The Rest is History: Investigations of WAL History Files

Brian.Pace@crunchydata.com (Brian Pace) — Fri, 23 Feb 2024 08:00:00 EST

PostgreSQL uses the concept of a timeline to identify a series of WAL records in space and time. Each timeline is identified by a number, a decimal in some places, hexadecimal in others. Each time a database is recovered using point in time recovery and sometimes during standby/replica promotion, a new timeline is generated.

A common mistake is to assume that a higher timeline number is synonymous with the most recent data. While the highest timeline points to the latest incarnation of the database, it doesn't guarantee that the database indeed holds the most useful data from an application standpoint. To discern the validity of this statement, a closer examination of the Write-Ahead Logging (WAL) history files is essential, unraveling the messages they convey.

In this discussion, we will explore a recovered database and trace the narrative embedded in the history files. By the conclusion, you will have gained a deeper insight into the functionality of these history files within Postgres, empowering you to address queries related to recovery processes and the database's historical journey (or may I call it the 'family tree').

Assessing current state

Let's begin by gaining insights into the current status of the database. The information obtained from the pg_controldata output indicates that the database is currently on timeline 11. Take note of the latest checkpoint Write-Ahead Logging (WAL) file, identified as '0000000B0000000100000039'. See my previous post on WAL file naming and numbering.

It's worth noting that timelines are sometimes expressed in decimal form, as in the case of 11, and at other times in hexadecimal form, such as '0000000B'. While this dual representation might be perplexing initially, familiarity with when and where these different forms are employed will contribute to a clearer understanding.

$ pg_controldata
...
pg_control last modified:             Tue 06 Feb 2024 03:10:53 PM EST
Latest checkpoint location:           1/39000060
Latest checkpoint's REDO location:    1/39000028
Latest checkpoint's REDO WAL file:    0000000B0000000100000039
Latest checkpoint's TimeLineID:       11
...

Another crucial aspect to consider is examining the contents of the pg_wal directory to identify the existing history files. If the current database was not the primary when timeline 11 was created, it may only have the latest history file present. In this case, the server we are investigating was and is the primary Postgres instance.

$ ls -1 $PGDATA/pg_wal/*.history
00000003.history
0000000A.history
0000000B.history

At first glance, one can assume that timeline 11 (remember the history file and segment files use hexadecimals) has a family tree that stems from timelines 10 (0000000A) and 3 (00000003). As for any assumption, we must verify this before we can confirm this assumption as fact. To do this, let's take a look at the contents of the history file for timeline 11.

$ cat 0000000B.history

1	0/710000A0	no recovery target specified

2	0/72000000	no recovery target specified

3	1/22000CE0	before 2024-02-03 00:37:49.381764-05

10	1/230000A0	no recovery target specified

Looking at the contents of the history file we see the 'family tree' of timeline 11. Timeline 11 was created from timeline 10 at LSN 1/230000A0. Timeline 10 was created from timeline 3 at LSN 1/22000CE0. Wait! What about timelines 4 - 9? What does 'no recovery target specified' and 'before 2024-02-03 00:37:49.381764-05' mean? These are the right questions to be asking. Let's continue our quest.

History file contents

Starting from the top of timeline 11's history file, we read down the list to see the family tree. The 'no recovery target specified' lets us know that the timeline was created, more than likely, from a promotion (select pg_promote() for example). Timeline 10 on the other hand was created using a point in time recovery performed against timeline 3. Where does the timestamp come from? Was that the timestamp of the last transaction? More great questions. Let us explore those.

The first thing we need to do is examine the contents of the WAL segment that timeline 10 was created from. In this case, timeline 10 was created from timeline 3 at LSN 1/22000CE0. Translating the LSN into the exact WAL segment gives us 000000030000000100000022. The number in the LSN before the slash is the 'high number' while the first few characters after the slash is the low number. These two numbers prefixed with the timeline give us the WAL segment name (remember in hexadecimal). Below is an extract from pg_waldump of this segment.

$ pg_waldump 000000030000000100000022
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 1/22000028, prev 1/21000138, desc: RUNNING_XACTS nextXid 1035 latestCompletedXid 1034 oldestRunningXid 1035
rmgr: Heap        len (rec/tot):     61/  1666, tx:       1035, lsn: 1/22000060, prev 1/22000028, desc: DELETE xmax: 1035, off: 4, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/16684 blk 0 FPW
rmgr: Heap2       len (rec/tot):   1460/  1460, tx:       1035, lsn: 1/220006E8, prev 1/22000060, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [5], blkref #0: rel 1663/5/16684 blk 0
rmgr: Heap2       len (rec/tot):     63/    63, tx:       1035, lsn: 1/22000CA0, prev 1/220006E8, desc: PRUNE snapshotConflictHorizon: 1034, nredirected: 0, ndead: 4, nunused: 0, redirected: [], dead: [1, 2, 3, 12], unused: [], blkref #0: rel 1663/5/16684 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:       1035, lsn: 1/22000CE0, prev 1/22000CA0, desc: COMMIT 2024-02-03 00:37:49.381764 EST
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 1/22000D08, prev 1/22000CE0, desc: RUNNING_XACTS nextXid 1036 latestCompletedXid 1035 oldestRunningXid 1036
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 1/22000D40, prev 1/22000D08, desc: SWITCH

At LSN 1/22000CE0 is where timeline 10 was created. Looking at the above, we see that position has a commit with a timestamp of '2024-02-03 00:37:49.381764 EST' (the last set of characters from the LSN is the offset within the WAL segment). The WAL history file is telling us that timeline 10 was created prior to this commit ('before ...'). Whatever transaction was committed at CE0 will not be in timeline 10.

So why is this timestamp relevant? To understand this, let me provide some background. The timestamp passed to pgBackRest for the point in time recovery was '2024-02-03 00:30:10 EST'. The commit at CE0 was the first transaction after our recovery target time. Thus, the history file shows 'before 2024-02-03 00:37:49.381764 EST'.

One last word before we move on. The last piece of information the history file has for us is that timeline 11 was created from timeline 10. Based on the 'no recovery target specified' we can safely assume this was from a promotion type event or there was a recovery and no more WAL segments were known or available.

Missing timelines

What about timelines 4 - 9? Don't worry, I have not forgotten this question. To get the history files for those timelines we will need to go to the pgBackRest repository. To retrieve those, I am going to execute a statement similar to the below to restore them from pgBackRest:

$ pgbackrest archive-get 00000004.history --stanza=rhino /app/pgdata/rhino.16/pg_wal/00000004.history

The above is repeated for timelines 4 - 9. We will start our investigation from timeline 9's history file. Below is the content:

$ cat 00000009.history
1	0/710000A0	no recovery target specified

2	0/72000000	no recovery target specified

3	3/DA0000A0	no recovery target specified

4	3/DB0000A0	no recovery target specified

5	3/DC000000	no recovery target specified

6	5/2E0000A0	no recovery target specified

7	5/300000A0	no recovery target specified

8	5/570000A0	no recovery target specified

Looking at the above we see that there was a normal progression (meaning no recoveries) from timeline 1 - 9. This does not mean that timeline 4, for example, does not have any updates past LSN 3/DB0000A0. That is a different topic for a different blog post. If we could graph our timelines it would look something like this:

Our journey has taken us pretty far to this point. However, we need to answer the opening question about which timeline has the latest data. First, we need to know how long timeline 10 and 11 have existed. To do this, we will once again perform a dump using pg_waldump of the starting WAL segment for timeline 10 (hint, will be the same segment name with a different timeline prefix). Take a look at the content of this segment:

$ pg_waldump 0000000A0000000100000022
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 1/22000028, prev 1/21000138, desc: RUNNING_XACTS nextXid 1035 latestCompletedXid 1034 oldestRunningXid 1035
rmgr: Heap        len (rec/tot):     61/  1666, tx:       1035, lsn: 1/22000060, prev 1/22000028, desc: DELETE xmax: 1035, off: 4, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/16684 blk 0 FPW
rmgr: Heap2       len (rec/tot):   1460/  1460, tx:       1035, lsn: 1/220006E8, prev 1/22000060, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [5], blkref #0: rel 1663/5/16684 blk 0
rmgr: Heap2       len (rec/tot):     63/    63, tx:       1035, lsn: 1/22000CA0, prev 1/220006E8, desc: PRUNE snapshotConflictHorizon: 1034, nredirected: 0, ndead: 4, nunused: 0, redirected: [], dead: [1, 2, 3, 12], unused: [], blkref #0: rel 1663/5/16684 blk 0
rmgr: XLOG        len (rec/tot):     42/    42, tx:          0, lsn: 1/22000CE0, prev 1/22000CA0, desc: END_OF_RECOVERY tli 10; prev tli 3; time 2024-02-06 13:25:33.877840 EST
...

According to the above, the recovery of the database completed at 1:25 PM EST on 2/6. That means that database (timeline 10 and 11), now contains 2 hours of application changes (assuming application was resumed immediately after restore). Now, let's compare this to the last WAL segment in timeline 9:

$ pg_waldump 000000090000000500000057
...
rmgr: Transaction len (rec/tot):     34/    34, tx:       1661, lsn: 5/570032B0, prev 5/57002B28, desc: COMMIT 2024-02-06 13:23:51.110938 EST
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 5/570032D8, prev 5/570032B0, desc: RUNNING_XACTS nextXid 1662 latestCompletedXid 1661 oldestRunningXid 1662
...

We can determine the latest WAL segment by reviewing the Postgres logs and/or the pgBackRest repository. In this example, WAL segment 000000090000000500000057 was the latest. This segment was restored and the above pg_waldump shows the last committed transaction was on 2/6/24 at 1:23:51 PM EST. This means that timeline 9 has 84 hours and some change of application changes. This was determined by measuring the difference of the 'before' timestamp in timeline 11's history file (the time it was forked from timeline 3) and the last transaction in timeline 9.

Back to our question. Does timeline 11 have the latest data? Maybe, it has a few hours at best of later processed data, but to accept that means losing 84 hours of data.

Conclusion

The timeline history files plus some handy investigation work can tell us a story of the databases 'family tree'. If you were called upon to do a restore of the database you can now make some wise choices on which timeline may contain the data most useful to the business.

Before performing a restore, reinitializing standby or replicas. Here are some useful steps to assist you in an investigation to determine what timeline has the most useful data from a business perspective:

look at pg_controldata to see what timeline you’re database is currently on
look at the WAL history files present in $PGDATA/pg_wal/*.history
if necessary, restore missing *.history files from your backup repository/location
look at when the timelines were created
examine the contents of the WAL segment(s)

Active Active in Postgres 16

Brian.Pace@crunchydata.com (Brian Pace) — Thu, 14 Sep 2023 09:00:00 EDT

Support for logical replication arrived at Postgres just over five years ago with Postgres 10. Since then it's had a steady stream of improvements, but mostly logical replication has been limited to migrating data or unidirectional change data capture workflows. With Postgres 16 freshly released today, Postgres now has a better foundation to leverage logical replication for active-active setups.

What is logical replication and active-active?

If you're unfamiliar with the concepts of logical replication or what does active-active mean we've got you covered.

Logical replication is a method of replicating data changes based on the logical contents of the database, rather than at the physical level (bytes on disk). In simplified terms you can think of it as INSERT, UPDATE, DELETE statements. Logical replication allows you to selectively replicate tables, specific columns, or even specific rows based on defined replication rules. This flexibility makes logical replication ideal for scenarios where you need to replicate only a subset of the data or perform transformations during replication.

Active-active replication - when referring to databases - is the ability to write to any of two (or more) Postgres instances and each have a full live set of data. Active-active is generally appealing to improve availability. Generally this brings complexity which can be a significant tradeoff. To date using logical replication for bi-directional replication was difficult and at best not very efficient.

Transaction loop back

Prior to Postgres 16 to even make this work with Postgres you had to accomplish some special processing to prevent transaction loop back.

Transaction loop back occurs when a transaction is replicated from the source to the target and then replicated back to the source. In Postgres 16 there is a feature that solves this problem. When creating a subscription, the subscriber asks the publisher to ignore transactions that were applied via the replication apply process. This is possible due to the origin messages in the WAL stream.

If you're still with us up to here, lets dig in and actually work on setting this up with Postgres 16.

Origin filter

The WAL stream contains information referred to in the documentation as 'origin messages'. These messages identify the source of the transaction, whether it was local or from an apply process. Let's take a look at the following to gain some insight into these messages.

Below is an excerpt from pg_waldump from a local transaction:

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/47000028, prev 0/46000A40, desc: RUNNING_XACTS nextXid 900 latestCompletedXid 899 oldestRunningXid 900
rmgr: Heap        len (rec/tot):    114/   114, tx:        900, lsn: 0/47000060, prev 0/47000028, desc: HOT_UPDATE off 17 xmax 900 flags 0x10 ; new off 18 xmax 0, blkref #0: rel 1663/5/24792 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:        900, lsn: 0/470000D8, prev 0/47000060, desc: COMMIT 2023-06-20 16:43:03.908882 EDT

Now let's Compare it with the COMMIT line from the logical replication apply process:

rmgr: Heap        len (rec/tot):     54/    54, tx:        901, lsn: 0/47000108, prev 0/470000D8, desc: LOCK off 18: xid 901: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/5/24792 blk 0
rmgr: Heap        len (rec/tot):    117/   117, tx:        901, lsn: 0/47000140, prev 0/47000108, desc: HOT_UPDATE off 18 xmax 901 flags 0x10 KEYS_UPDATED ; new off 19 xmax 901, blkref #0: rel 1663/5/24792 blk 0
rmgr: Transaction len (rec/tot):     65/    65, tx:        901, lsn: 0/470001B8, prev 0/47000140, desc: COMMIT 2023-06-20 16:43:17.412369 EDT; origin: node 1, lsn 6/A95C2780, at 2023-06-20 16:43:17.412675 EDT

Notice the origin message in the COMMIT entry. This indicates that the transaction originated from 'node 1' at source LSN '6/A95C2780'. With Postgres 16, setting the 'origin=none' flag on the subscriber instructs the publisher to only send messages that do not have this origin information, indicating it was a transaction performed locally.

Sample environment

Let's do a quick test of setting up an active active replication set up. Start by creating two Postgres 16 instances. Set the following Postgres parameters to configure the instance for logical replication:

- wal_level = 'logical'
- max_worker_processes = 16

The WAL level set to logical will start the logical decoding. Since we are adding several processes on both sides to extract and replay, I also increased max worker processes to not interfere with other replication activities. After setting the above parameters, restart Postgres. For this example, the two Postgres instances will be referred to as pg1 and pg2.

In pg1, execute the following to configure the sample database objects.

CREATE SEQUENCE emp_eid_seq
START 1
INCREMENT 2;

CREATE TABLE emp (eid int NOT NULL DEFAULT nextval('emp_eid_seq') primary key,
first_name varchar(40),
last_name varchar(40),
email varchar(100),
hire_dt timestamp);

INSERT INTO emp (FIRST_NAME,LAST_NAME,EMAIL,HIRE_DT) VALUES ('John', 'Doe', 'johndoe@example.com', '2021-01-15 09:00:00'),
('Jane', 'Smith', 'janesmith@example.com', '2022-03-20 14:30:00'),
('Michael', 'Johnson', 'michaelj@example.com', '2020-12-10 10:15:00'),
('Emily', 'Williams', 'emilyw@example.com', '2023-05-05 08:45:00'),
('David', 'Brown', 'davidbrown@example.com', '2019-11-25 11:20:00'),
('Sarah', 'Taylor', 'saraht@example.com', '2022-09-08 13:00:00'),
('Robert', 'Anderson', 'roberta@example.com', '2021-07-12 16:10:00'),
('Jennifer', 'Martinez', 'jenniferm@example.com', '2023-02-18 09:30:00'),
('William', 'Jones', 'williamj@example.com', '2020-04-30 12:45:00'),
('Linda', 'Garcia', 'lindag@example.com', '2018-06-03 15:55:00');

In pg2, a slightly different script is used to prepare the database objects.

CREATE SEQUENCE emp_eid_seq
START 2
INCREMENT 2;

CREATE TABLE emp (eid int NOT NULL DEFAULT nextval('emp_eid_seq') primary key,
first_name varchar(40),
last_name varchar(40),
email varchar(100),
hire_dt timestamp);

Notice special design considerations are already taking shape. To avoid primary key conflicts, pg1 generates primary key values with odd numbers, INCREMENT=2, and pg2 will use even numbers.

Last setup piece is to create a user for replication on both systems.

CREATE ROLE repuser WITH REPLICATION LOGIN PASSWORD 'welcome1';
GRANT all ON all tables IN schema public TO repuser;

Publication

Using a publish/subscribe model, changes captured in one Postgres instance (publisher) can be replicated to multiple Postgres instances (subscribers). Using the command below create a publisher on each instance.

pg1:

CREATE PUBLICATION hrpub1
FOR TABLE emp;

pg2:

CREATE PUBLICATION hrpub2
FOR TABLE emp;

The publication name could have been the same for each side, but having different names will help us later on as we measure latency using a custom heartbeat table.

Subscription

With the publishers ready, the next step is to create the subscribers. By default, logical replication starts with an initial snapshot on the publisher and copies the data to the subscriber. Since we are doing bi-directional, we will allow the snapshot from pg1 to pg2, but do not need the reverse copy to happen and will therefore disable the initial copy.

pg1:

CREATE SUBSCRIPTION hrsub1
  CONNECTION 'host=pg2 port=5432 user=repuser password=welcome1 dbname=postgres'
  PUBLICATION hrpub2
  WITH (origin = none, copy_data = false);

pg2:

CREATE SUBSCRIPTION hrsub2
  CONNECTION 'host=pg1 port=5432 user=repuser password=welcome1 dbname=postgres'
  PUBLICATION hrpub1
  WITH (origin = none, copy_data = true);

The key is the origin setting in the subscription (origin = none). The default for origin is 'any' which will instruct the publisher to send all transactions to the subscriber regardless of the source. For bi-directional this is bad. With the setting of 'any', an update performed on pg1 would be replicated to pg2 (so far so good). That replicated transaction would be captured and sent back to pg1, and so forth. This is what we call transaction loopback.

By setting origin to none, the subscriber will request the publisher to only send changes that do not have an origin and thus ignore replicated transactions. Now, Postgres is ready for bi-directional logical replication.

After a few seconds, verify that the initial copy of the emp table has occurred between pg1 and pg2.

Replication Test

With the replication configured, update data on each side and verify replication.

pg1:

SELECT * FROM emp WHERE eid=1;
UPDATE emp SET first_name='Bugs', last_name='Bunny' WHERE eid=1;

pg2:

SELECT * FROM emp WHERE eid=1;
SELECT * FROM emp WHERE eid=3;
UPDATE emp SET first_name='Road', last_name='Runner' WHERE eid=3;

Don't get too carried away

Setting up bi-directional replication is easy, but not without risk. Before you go open a pull request against prod, there are many things to consider like monitoring, restrictions, change volume, application behavior, backup and recovery, data reconciliation, etc. Let's do a quick exercise to demonstrate application behavior that results in data integrity issues. For the following example open a database connection to both pg1 and pg2.

Start a transaction in each session using the following and note the value of email and last_name.

BEGIN;
SELECT * FROM emp WHERE eid=1;

In pg1 update the email address of the employee with EID = 1 but do not commit.

UPDATE emp SET email='bugs.bunny@acme.com' WHERE eid=1;

In pg2, update the last name but do not commit.

UPDATE emp SET last_name='Jones' WHERE eid=1;

The expectation after committing is that last_name will be equal to Jones and email will be bugs.bunny@acme.com. Commit the transaction in pg1 and then in pg2. What happens?

--pg1
SELECT * FROM emp WHERE eid=1;
 eid | first_name | last_name |          email          |       hire_dt
-----+------------+-----------+-------------------------+---------------------
   1 | Bugs       | Jones     | johndoe@example.com     | 2022-09-25 16:04:47
(1 row)

--pg2
SELECT * FROM emp WHERE eid=1;
 eid | first_name | last_name |        email        |       hire_dt
-----+------------+-----------+---------------------+---------------------
   1 | Bugs       | Bunny     | bugs.bunny@acme.com | 2022-09-25 16:04:47
(1 row)

Now both rows are out of sync. In pg1, the update of the email was lost and in pg2 the update of last_name was lost. This happens because the entire row is sent over during logical replication and not just the fields that were updated. In such cases, even eventual consistency is not possible.

Conclusion

Logical replication with PostgreSQL offers a flexible and powerful solution for replicating data changes across multiple database instances. Its ability to selectively replicate data, scalability features, and high availability options make it a valuable tool for various use cases. New features in Postgres 16 take an already powerful feature and make it even better. Bi-directional replication is now within reach using native replication. However, one must plan and test to maintain data integrity and consistency.

If you want to test it out, try Crunchy Bridge. The default WAL level is logical and we're running Postgres 16.