Craig Kerstiens | CrunchyData Blog

Indexing JSONB in Postgres

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Mon, 11 Aug 2025 13:00:00 EDT

Postgres is amazing for many reasons. One area that doesn't get near enough attention is datatypes. Postgres has a rich set of datatypes and one important one for devs to be especially excited about is JSONB.

JSONB which is structured and indexable. In JSON, the B stands for binary (or as we like to think of it B is for better), which means data is pre-parsed as it is stored. How do you get the most out of JSONB from a retrieval perspective? Enter Postgres' rich indexing support.

Postgres index types

Most databases have a standard single index type: B-Tree. B-Tree is a balanced tree structure, and the common type of index you learn about from a CS degree perspective. When you do a standard CREATE INDEX a B-Tree index is what is created. This works for standard WHERE clauses that target that value.

But Postgres has other index types including:

GIN - Generalized Inverted Index
GiST - Generated Search Tree
Sp-GiST - Space-Partitioned Generalized Search Tree
BRIN - Block Range Index

So the go-to Postgres database, the B-Tree, isn’t suited well for JSON documents, or at least not how you may think, because of the nature of nested structures. So how do you index your JSONB to more efficiently query it? Enter GIN indexes.

GIN indexes for JSONB

Instead of indexing the entire JSONB document as a unit, a GIN index breaks it apart and indexes the keys and values inside. Think of it as creating a giant lookup table under the hood. If you have a row like this:

{
  "status": "active",
  "plan": "pro"
}

A GIN index on this row will store entries like:

status => active

plan => pro

This makes GIN ideal for answering questions like:

SELECT *
FROM my_table
WHERE data @> '{"status": "active"}';

The @> operator (JSONB containment) is GIN indexable, and GIN can quickly find documents where that key value pair exists.

Creating the index is straightforward:

CREATE INDEX idx_data_gin
ON my_table
USING gin (data);

You can also be more specific with a partial or expression index if you only need to index a subset of keys:

CREATE INDEX idx_status_gin
ON my_table
USING gin ((data->'status'));

What queries use the GIN index?

The key here is: not all JSONB queries will benefit from a GIN index. Queries that can use GIN include:

Containment: data @> '{"plan": "pro"}'
Key existence: data ? 'status'
Any key match: data ?| array['plan', 'tier']
All keys match: data ?& array['plan', 'status']

These are operator-based queries that map well to the inverted index structure.

What JSONB queries don’t use GIN?

Here’s where sometimes you can get caught off guard. You added a GIN index, set up the query in your app, and have the worst of both worlds: an index being maintained but slow queries because they can't use the index.

GIN indexes won’t help with:

Path-based navigation: data->'user'->>'email' = 'craig@example.com'
Comparisons within the JSONB: (data->>'age')::int > 30
Regex or pattern matches inside values: data->>'name' ILIKE 'craig%'

Maintenance with GIN and JSONB

While GIN indexes are powerful, they have a larger write overhead than standard B-tree indexes. This overhead becomes especially apparent if you're frequently updating large JSONB columns. Frequent large updates can lead to index bloat, where the index contains too many references to dead rows and becomes inefficient.

Due to the potential for bloat, actively monitoring the health of your GIN indexes is key. You can manage this by periodically running the REINDEX CONCURRENTLY command to rebuild the index and reclaim wasted space. We also recommend using internal tools like the pgstattuple extension to check the index's status and identify bloat before it becomes a significant issue.

Expression indexes for JSONB

For some cases, creating a typical B-tree expression index can help with JSON that doesn’t fit the GIN use cases. Creating an expression index involves defining an index not just a part of the array, but on the result of an operation performed on that column. For instance:

CREATE INDEX idx_orders_total ON orders (((details->>'order_total')::numeric))

This builds an index on the orders table. It works by first accessing the details jsonb column, extracting the value associated with the order_total key as text using the ->> operator, and then casting that text value to a numeric type. Postgres then builds a standard B-tree index on these resulting numeric values. This can be efficient for range scans and sorting inside these JSONB rows.

Keep in mind that a requirement for using an expression index is that the WHERE clause of your query must exactly match the expression used to define the index. For the index created above, a query like WHERE (details->>'order_total')::numeric > 100 would use the index, but a slightly different query, such as WHERE (details->>'order_total')::float > 100, would not.

Strict matching means expression indexes are great for optimizing well-defined, static queries that are embedded in your application's code. But not queries that often change.

Best practices for JSONB indexing

Use GIN for containment-style lookups, especially if you don't know the full schema ahead of time.
Don’t GIN the whole JSONB column if you only ever query specific keys—use expression indexes or partial indexes instead.
Combining GIN with traditional B-tree indexes (ie expression indexes) on structured columns is the key to keeping performance predictable.

JSONB is a powerful tool in the PostgreSQL toolbox, but to unlock its performance potential, you have to understand the indexing story. GIN indexes are your best friend when you need to query inside documents but they're not a silver bullet. Knowing when and how to use them is key.

If you're working with JSONB-heavy workloads in production, this is one of those "measure twice, index once" situations. If you’re on Crunchy Bridge or managing a large Postgres fleet, having observability into which indexes are getting used (and which aren’t) is just as important.

Citus: The Misunderstood Postgres Extension

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Tue, 18 Mar 2025 09:50:00 EDT

Citus is in a small class of the most advanced Postgres extensions that exist. While there are many Postgres extensions out there, few have as many hooks into Postgres or change the storage and query behavior in such a dramatic way. Most that come to Citus have very wrong assumptions. Citus turns Postgres into a sharded, distributed, horizontally scalable database (that's a mouthful), but it does so for very specific purposes.

Citus, in general, is fit for these type of applications and only these type:

Sharding a multitenant application: a SaaS/B2B style app, where data is never joined between customers
Low user facing, high data volume analytics: specifically where the dashboards are hand-curated with minimal levers-and-knobs for the user to change (i.e. customer cannot generate unknown queries)

Mistaken use cases for Citus that are not a great fit:

Lack of rigid control over queries sent to database
Geographic residency goals or requirements; Citus is distributed for scale, not distributed for edge.

Let's look closer at each of the two use cases that Citus is a good fit for.

Multitenant/SaaS applications

Multitenant or SaaS applications typically follow a pattern: 1) tenant data is siloed and does not intermingle with any other tenant's data, and 2) a "tenant" is a larger entity like a "team" or "organization".

An example of this could be Salesforce. Within Salesforce you have the notion of an organization, and the organization has accounts, customers, and opportunities within them. When you create a Salesforce account, all of your customers and opportunities are solely yours — data is not shared with other Salesforce organizations.

For these types of applications, Citus distributes the data for each tenant into a shard. Citus handles the splitting of data by creating placement groups that know they are grouped together, and placing the data within shards on specific nodes. A physical node may contain multiple shards. Let me restate that to understand Citus at a high-level:

physical node: the physical container that holds shards
shard: a logical container for data; resides on a physical node, and can be moved between physical nodes
placement group: uses a hash-based algorithm to assign a tenant id to a shard

Regarding shards, while possible to split a large shard, it is easier to start with the proper configuration. Getting scaling right in the beginning makes it easier later because moving full shards is easier than splitting them once they already exist, though that is possible.

In a very basic Citus cluster, you might have something that looks like:

Within Citus, multitenant/SaaS applications can work well because sharding is at the core of what Citus does. In the case of a tenant application, the tenant id becomes the shard key. When you shard all the tables on the same key, Citus places each table on the same physical node. Then, queries with joins are executed local to the instance and faster.

Alternatively, poor shard key planning would require joining data across the network. This shuffling of data is detrimental to performance within databases – especially distributed ones. For multitenant/SaaS, leveraging Citus requires the tenant id a column on every table.

While in a more simple design, accounts, customers, and opportunities tables may have only a primary key and a foreign key reference to their parent relationship. In Citus, we need to turn those into composite primary keys that leverage both tenant id and the foreign key. Extending the above diagram, if we were to now create accounts, customers, and opportunities tables as sharded tables with Citus, we'd have something that roughly results to the following:

To speed query performance, include a where condition for the tenant id (below org_id) in all queries as well — this ensures that Citus knows how to push down the join to that single node. A query for open opportunities for a specific tenant might look something like:

SELECT customer.email, customer.first_name, customer.last_name, opportunity.amount, opportunity.notes
FROM opportunity,
     customer,
     account
WHERE customer.org_id = account.org_id
  AND opportunity.org_id = account.org_id
  AND opportunity.account_id = customer.account_id
  AND account.org_id = 4;

Citus would then quietly re-write this query to target the appropriate sharded tables, and effectively execute the query against only the relevant tables:

Now, there is a bit more to designing multitenant apps to work with Citus. For example, universal data can be placed in reference tables that can be distributed across all nodes, or local tables that can live solely within the coordinator. For the bulk of a Citus multitenant workload, tables will:

Contain your shard key
Be indexed using a composite key on shard key + foreign key
Be distributed based on the shard key / tenant id
Be queried used the on the shard key / tenant id

Let's shift to the other common use case for Citus: what Citus defines as real-time dashboards or analytics.

Real-time analytics with Citus

Where multitenant leverages the shard separation of Citus, here you're looking to leverage the parallelism of Citus.

Real-time analytics is indeed a bit vague. It is often some kind of event data that is high volume and is presented as a dashboard, report, monitoring, or alerting. Query patterns are often aggregating in some form; while there may be joins, they happen at a lower level then bubble up to a higher level for aggregation.

When operating a small volume of data, you don't necessarily need Citus — plain old Postgres can work just fine. With high data volume, Postgres is not as suited for analytics (unless you're talking Crunchy Data Warehouse, which is optimized for OLAP workloads – see more here).

With the multitenant/SaaS example, we wanted the query to be pushed down to a single node and operate within a single physical node. With real-time analytics, we want the opposite: queries execute across all the nodes using as many cores as available within the cluster.

Let's make this a little more concrete. Start with the idea of a Google Analytics type of event analytics — similar to what is talked about in the Citus docs. Here we may have something like:

CREATE TABLE http_request (
  site_id INT,
  ingest_time TIMESTAMPTZ DEFAULT now(),

  url TEXT,
  request_country TEXT,
  ip_address TEXT,

  status_code INT,
  response_time_msec INT
);

Let's jump ahead and look past how we shard the data and to the query itself. The query shows a better idea of how Citus works in these situations. Let's build a query to return how many 404s and 200s from the country "Australia" along with the average response time for each:

SELECT
  status_code,
  COUNT(id) AS request_count,
  AVG(response_time_msec) AS average_response_time_msec
FROM http_request
WHERE request_country = 'Australia'
  AND status_code IN (200, 404);

This query will run on every single shard. To process the query as fast as possible, the number of shards should match the number of cores available. If you end up with something like 16 shards in a single node, you'd want ideally 16 cores or even more (to handle additional concurrency). The query will be executed as smaller composable building blocks.

Citus processing the count of 404s and 200s is easy. It runs the query as a count on the nodes, then the coordinator calculates the sum of counts. We simply get: the sum of count(id) where country = "Australia" and the appropriate status code.

But! To calculate the average response time we need to get the count from each shard as well as the sum of the response_time_msec values. From there, Citus recombines all those back on the coordinator. Citus has each shard sending 4 values back (versus all the raw data), and doing the final math on the coordinator.

This results in fast aggregations across large datasets. But if you haven't thought ahead yet, this only works for very specific queries. Counts and averages are great. If you're looking to do something like median, that gets a little harder. You need the full data set to compute a perfect median. (For now we're setting aside there are probabilistic approaches to getting approximate results that work quite well. Algorithms like t-digest or KLL can work if you're okay with approximate or inexact answers).

The other big piece of this is your queries need to be able to be constructed for Citus to push down any joins as locally as possible. While our example in this case is a very basic one, most applications still have data they need to join. This can work on Citus, but you still need to apply some of the thought in making joins to be as low level as possible — similar to the multitenant app.

Within the "real-time analytics" model you need the following to work in order to be successful:

Ability to push joins down vs. joins that move data between nodes
Heavy aggregation or roll-up workload
Control over crafting the queries that are created

Concurrency and connections

The one "gotcha" of the real-time analytics use case is concurrency. In our simple example of querying http_request, it's great if you only have 4 shards. But in a world of 64 shards spread across 4 nodes–you have 16 nodes per shard. This means a single query to Postgres could open 16 connections to each node. One weak area of Postgres is connection management and scaling those, so, we recommend and support pgBouncer out of the box across all our products.

Designing up front for Citus

A success factor with Citus will be your use case. If it is a fit, the more greenfield the application, the better your chance. Existing applications can absolutely be retrofitted to work with Citus, but it often takes some data maneuvering, schema modifications, and query modifications. As with many technologies, if Citus is the right tool for you then "Awesome!", you should absolutely use it. For questions if you think Citus may or may not be a fit, reach out to us @crunchydata. We've helped a number of customers successfully adopt Citus in cases. In others, we've helped our customers be successful on different paths. While Citus is very powerful, it is a special purpose tool.

A change to ResultRelInfo - A Near Miss with Postgres 17.1

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Fri, 15 Nov 2024 10:30:00 EST

Version 17.2 of PostgreSQL has now released which rolls back the changes to ResultRelInfo. See the release notes for more details.

Since its inception Crunchy Data has released new builds and packages of Postgres on the day community packages are released. Yesterday's minor version release was the first time we made the decision to press pause on a release. Why did we not release it immediately? There appeared to be a very real risk of breaking existing installations. Let's back up and walk through a near miss of Postgres release day.

Yesterday when Postgres 17.1 was released there appeared to be breaking changes in the Application Build Interface (ABI). The ABI is the contract that exists between PostgreSQL and its extensions. Initial reports showed that a number of extensions could be affected, triggering warning sirens around the community. In other words, if you were to upgrade from 17.0 to 17.1 and use these extensions, you could be left with a non-functioning Postgres database. Further investigation showed that TimescaleDB and Apache AGE were the primarily affected extensions and if you are using them you should hold off at this time upgrading to the latest minor release or ensure to rebuild the extension against the latest PostgreSQL release in coordination with your upgrade.

The initial list of extensions for those curious:

Affected	Unaffected
Apache AGE	HypoPG
TimescaleDB	pg-query
	Citus
	pglast
	pglogical
	pgpool2
	ogr-fdw
	pg-squeeze
	mysql-fdw

First, a little bit on Postgres releases. Postgres releases major versions each year, and minor versions every three months roughly. The major versions are expected to be forward compatible, but do introduce bigger changes that result in catalog changes. Major version upgrades are intended to be treated with caution. Minor version releases in contrast are intended to be only security and bug fix related. They are meant to be able to drop in and continue working within the same existing major version line.

About the Postgres ABI and Postgres Extension

The Postgres ABI, Application Binary Interface, refers to the binary-level interface between Postgres and compiled extensions, modules, or clients that interact with it. The ABI includes various structs that define key components of the system's internal workings. These structs represent how PostgreSQL manages and manipulates data, query execution, memory. They typically include things like:

System catalogs
Function signatures
Data structure layouts

Why Does the ABI Matter?

Developers of extensions ensure their extensions are compatible with the Postgres ABI. Changes to the ABI between major versions necessitates recompiling any extensions to prevent runtime issues.

ABI compatibility is typically not maintained across major versions. For instance, an extension compiled for PostgreSQL 14 will likely need to be recompiled for PostgreSQL 15 because ABI changes can occur.

PostgreSQL typically aims to maintain compatibility for extensions across minor versions. This means if you build an extension for PostgreSQL 15.1, it should work for 15.2. However, this is not always the case. The nuances of PostgreSQL ABI guarantees have been a sufficiently hot topic that they produced new documentation on the subject back in July.

Yesterday there was a major struct change in 17.1.

With us so far? Let’s go deeper

Within a PostgreSQL extension there is C code that includes header files from PostgreSQL itself. When the extension is compiled, functions from those headers are represented as abstract symbols in binary. The symbols are linked to the actual implementations of the functions when the extension is loaded based on the function names. That way, an extension compiled against PostgreSQL 17.0 can usually still be loaded into PostgreSQL 17.1, as long as the function names and signatures from headers do not change (i.e. the application binary interface or "ABI" is stable).

The header files also declare structs that are passed to functions (as pointers). Strictly speaking, the struct definitions are also part of the ABI, but there is more subtlety around that. After compilation, structs are mostly defined by their size and offsets of fields, so for instance a name change does not affect ABI (though does affect API). A size change does affect ABI, a little.

typedef struct ResultRelInfo
{
	NodeTag		type;

        /*... (130 other lines) ...*/

	/* updates do LockTuple() before oldtup read; see README.tuplock */
	bool		ri_needLockTagTuple;

} ResultRelInfo;

Most of the time, PostgreSQL allocates structs on the heap using a macro that looks at the compile-time size of the struct ("makeNode") and initializes the bytes to 0. The discrepancy that arose in 17.1 is that a new boolean was added to the ResultRelInfo struct, which increased its size from 376 bytes to 384.

What happens next depends on who calls makeNode. If it's PostgreSQL 17.1 code, then it uses the new size. If it's an extension compiled against 17.0, then it uses the old size. When it calls a PostgreSQL function with a pointer to a block allocated using the old size, the PostgreSQL function still assumes the new size and may write past the allocated block.

That is in general quite problematic. It could lead to bytes being written into an unrelated section of memory, or the program crashing. When running tests, PostgreSQL has internal checks (asserts) to detect that situation and throw warnings.

So, in general this particular change in the struct does not actually affect the allocation size. There may be uninitialized bytes, but that is usually resolved by calling InitResultRelInfo. The issue primarily causes warnings in tests / assert-enabled builds for extensions that allocate ResultRelInfo, though only when running those tests using the new PostgreSQL version with an extension binary that was compiled against the old PostgreSQL versions.

Did we lose you yet, and so what’s the result?

Unfortunately, that's not the end of the story. Extensions that rely heavily on ResultRelInfo (like TimescaleDB) and can do some things that suffer from the size change. For instance, in one of TimescaleDB's code paths, it needs to find the index of a ResultRelInfo pointer in an array, and to do so it does pointer math. This array was allocated by PostgreSQL (384 bytes), but the Timescale binary assumes 376 bytes and the result is a nonsense number which then hits an assert failure or segmentation fault.

To be clear, the code here is not really at fault. The contract with PostgreSQL was simply not quite as assumed. For developers of Postgres extensions that's an interesting lesson for all of us.

It's quite possible that there are other issues like this in other extensions. TimescaleDB is quite popular and thus subject to broader testing that identified the issue. That said, as investigation occurred over the past 24 hours most that built against this header thus far do seem to be safe. Another advanced extension is Citus, but from our investigation the Citus extension does seem safe.

What should you do?

If you’re a Crunchy Data customer you do not need to worry. If you’re using Crunchy Data Postgres on any platform, Crunchy Bridge, Crunchy Postgres for Kubernetes - our build, release and certification procedures worked as anticipated and appropriate mitigations were applied to any of our software releases. We are fortunate to have a fantastic build and release team that is largely behind the scenes but ensures issues like this are handled. If you’re a community Postgres user, or have packaged your own extensions, it is worth reading the psql-hackers thread in order to understand which extensions have been determined to potentially be impacted and to understand the potential mitigations for the below affected versions:

17.0 -> 17.1
16.4 (and earlier) -> 16.5
15.8 (and earlier) -> 15.9
14.13 (and earlier) -> 14.14
13.16 (and earlier) -> 13.17
12.20 (and earlier) -> 12.21

In short:

If you are using TimescaleDB extension, Timescale is recommending that users do not perform the minor version installs at this time.
If you are using extensions that are indicated as potentially impacted within the pgsql-hackers list thread, additional caution is warranted before upgrading (though our own Marco Slot has confirmed that Citus is not impacted)
If you are compiling Postgres extensions from source, make sure your extensions have been compiled using the latest point version 17.1
If you are developing or installing custom Postgres extensions, it is worth taking the time to understand the impact of this particular issue and the Postgres ABI ‘commitments’.
Ultimately the default guidance of performing Postgres minor version upgrades stands and the impact of this issue was not as broad as was initially feared. The Postgres community once again provided a timely minor version release to address a collection of CVEs and fixes, and the community promptly responded to a report of potential issues. The ecosystem of Postgres providers release processes worked as intended and it appears any potential impact was largely averted.
That said, software is hard, databases in particular are tricky. As Postgres extensions grow in popularity these risks will continue to emerge and it is helpful to understand these details or ensure when selecting who is supporting you on your database they understand these issues.

pg_parquet: An Extension to Connect Postgres and Parquet

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Thu, 17 Oct 2024 10:30:00 EDT

Today, we’re excited to release pg_parquet - an open source Postgres extension for working with Parquet files. The extension reads and writes parquet files to local disk or to S3 natively from Postgres. With pg_parquet you're able to:

Export tables or queries from Postgres to Parquet files
Ingest data from Parquet files to Postgres
Inspect the schema and metadata of existing Parquet files

Code is available at: https://github.com/CrunchyData/pg_parquet/.

Read on for more background on why we built pg_parquet or jump below to get a walkthrough of working with it.

Why pg_parquet?

Parquet is a great columnar file format that provides efficient compression of data. Working with data in parquet makes sense when you're sharing data between systems. You might be archiving older data, or a format suitable for analytics as opposed to transactional workloads. While there are plenty of tools to work with Parquet, Postgres users have been left to figure things out on their own. Now, thanks to pg_parquet, Postgres and Parquet easily and natively work together. Better yet, you can work with Parquet without needing yet another data pipeline to maintain.

Wait, what is Parquet? Apache Parquet is an open-source, standard, column-oriented file format that grew out of the Hadoop era of big-data. Using a file, Parquet houses data in a way that is optimized for SQL queries. In the world of data lakes, Parquet is ubiquitous.

Using pg_parquet

Extending the Postgres copy command we're able to efficiently copy data to and from Parquet, on your local server or in s3.

-- Copy a query result into a Parquet file on the postgres server
COPY (SELECT * FROM table) TO '/tmp/data.parquet' WITH (format 'parquet');

-- Copy a query result into Parquet in S3
COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');

-- Load data from Parquet in S3
COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');

Let's take an example products table, but not just a basic version, one that has composite Postgres types and arrays:

-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);

-- create a table with complex types
CREATE TABLE product_example (
    id int,
    product product,
    products product[],
    created_at TIMESTAMP,
    updated_at TIMESTAMPTZ
);

-- insert some rows into the table
INSERT INTO product_example values (
    1,
    ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
    ARRAY[ROW(1, NULL, NULL)::product, NULL],
    now(),
    '2022-05-01 12:00:00-04'
);

-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';

-- show table
SELECT * FROM product_example;

Inspecting Parquet files

In addition to copying data in and out of parquet, you can explore existing Parquet files to start to understand their structure.

-- Describe a parquet schema
SELECT name, type_name, logical_type, field_id
FROM parquet.schema('s3://mybucket/data.parquet');
┌──────────────┬────────────┬──────────────┬──────────┐
│     name     │ type_name  │ logical_type │ field_id │
├──────────────┼────────────┼──────────────┼──────────┤
│ arrow_schema │            │              │          │
│ name         │ BYTE_ARRAY │ STRING       │        0 │
│ s            │ INT32      │              │        1 │
└──────────────┴────────────┴──────────────┴──────────┘
(3 rows)

-- Retrieve parquet detailed metadata including column statistics
SELECT row_group_id, column_id, row_group_num_rows, row_group_bytes
FROM parquet.metadata('s3://mybucket/data.parquet');
┌──────────────┬───────────┬────────────────────┬─────────────────┐
│ row_group_id │ column_id │ row_group_num_rows │ row_group_bytes │
├──────────────┼───────────┼────────────────────┼─────────────────┤
│            0 │         0 │                100 │             622 │
│            0 │         1 │                100 │             622 │
└──────────────┴───────────┴────────────────────┴─────────────────┘
(2 rows)

-- Retrieve parquet file metadata such as the total number of rows
SELECT created_by, num_rows, format_version
FROM parquet.file_metadata('s3://mybucket/data.parquet');
┌────────────┬──────────┬────────────────┐
│ created_by │ num_rows │ format_version │
├────────────┼──────────┼────────────────┤
│ pg_parquet │      100 │ 1              │
└────────────┴──────────┴────────────────┘
(1 row)

Parquet and the cloud

If you’re working with object storage managing your Parquet files – likely S3 of something S3 compatible. If you configure your ~/.aws/credentials and ~/.aws/config files, pg_parquet will automatically use those credentials allowing you to copy to and from your cloud object storage.

$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

$ cat ~/.aws/config
[default]
region = eu-central-1

Being able to directly access object storage via the COPY command is very useful for archival, analytics, importing data written by other applications, and moving data between servers.

In conclusion

Postgres has long been trusted for transactional workloads, but we believe in the very near future, it will be equally as capable for analytics. We’re excited to release pg_parquet as one more step towards making Postgres the only database you need.

Row Level Security for Tenants in Postgres

Craig.Kerstiens@crunchydata.com (Craig Kerstiens) — Wed, 03 Apr 2024 09:00:00 EDT

Row-level security (RLS) in Postgres is a feature that allows you to control which rows a user is allowed to access in a particular table. It enables you to define security policies at the row level based on certain conditions, such as user roles or specific attributes in the data. Most commonly this is used to limit access based on the database user connecting, but it can also be handy to ensure data safety for multi-tenant applications.

Creating tables with row level security

We're going to assume our tenants in this case are part of an organization, and we have an events table with events that always correspond to an organization:

CREATE TABLE organization (
    org_id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    name VARCHAR(255) UNIQUE,
    created_at TIMESTAMPTZ default now(),
    deleted_at TIMESTAMPTZ default now(),
);

CREATE TABLE events (
  org_id UUID,
  event_type TEXT,
  event_value INT,
  occurred_at TIMESTAMPTZ default now(),
);

We're going to turn on RLS on our events table.

ALTER TABLE events ENABLE ROW LEVEL SECURITY;

And then set a policy that is enforced based on the connected database user.

CREATE POLICY event_isolation_policy
  ON events
  USING (org_id::TEXT = current_user);

Now if you had a org id of c4062d63-335e-4631-b03c-504d0eb88122 and created a database user with that login, when they connected to the database and queried the events table they'd only receive their events.

Using Session Variables

The above works great when you're giving out raw database access. But creating a new database user and connecting with that unique user for each new request that comes in is a lot of overhead and takes away many of the tools that exist for managing and scaling Postgres connections.

Instead of a user per customer, what we're going to do is set a session variable when we connect.

CREATE POLICY event_session_user
  ON events
  TO application
  USING (org_id = NULLIF(current_setting('rls.org_id', TRUE), '')::uuid);

Now when you connect in your session you'll set the value, and then can query the events table:

SET rls.org_id = 'c4062d63-335e-4631-b03c-504d0eb88122';
SELECT * FROM events;

For example, in a web application, you might set the app.current_org_id variable in the application component based on the current user's authentication. Use a function to get current org_id from the request and then pass in the current org in the connection.

@app.before_request
def before_request():
    org_id = get_current_org_id(request)
    g.db = psycopg2.connect(database="mydb")
    with g.db.cursor() as cur:
        cur.execute("SET app.current_org_id = %s", (org_id,))

With this setup, any queries executed within the request will automatically be filtered based on the current tenant ID, ensuring that each tenant can only access their own data.

Keep in mind that when designing your app for multi-tenancy, ideally you have that ord_id in every table.

Conclusion

Postgres Row-Level Security provides a powerful mechanism for securing multi-tenant applications, allowing you to enforce data isolation and privacy at the row level. By defining policies based on tenant IDs, org IDs, or other criteria, you can ensure that each tenant can only access their own data, enhancing the overall security of your application.

Check out our browser based tutorial for learning more about Row Level Security and session variables.