CrunchyData Blog

Using acts_as_tenant for Multi-tenant Postgres with Rails

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Wed, 20 Dec 2023 08:00:00 EST

Since its launch, Ruby on Rails has been a preferred open source framework for small-team B2B SaaS companies. Ruby on Rails uses a conventions-over-configuration mantra. This approach reduces common technical choices, thus elevating decisions. With this approach, the developers get an ORM (ActiveRecord), templating engine (ERB), helper methods (like number_to_currency), controller (ActiveController), directory setup defaults (app/{models,controllers,views}), authentication methods (has_secure_password), and more.

Multi-tenant is the backbone of B2B SaaS products, yet core-Rails remains un-opinionated on multi-tenant implementations. Through the years, there has been many different Ruby gem implementations for multi-tenant. Many of these gems were built for complicated situations — either adapting to scaling needs or regulated industries that require physical separation of data. Many of these gems required deep integration with your Rails application code.

Enter acts_as_tenant

With all that as context, the acts_as_tenant gem is super simple. acts_as_tenant has recently released version 1.0 after 12 years of development — so it’s not new. The gem implements multi-tenant best-practices by augmenting Rails’ ActiveRecord ORM:

protects developers from building queries that return other tenant’s records
requires a tenant_id on the tables for models specific to a tenant
adds the tenant_id scope to the query
includes ActionController, ActiveRecord, ActiveJob helpers to insert new records with the scoped tenant

Acts_as_tenant is built for row-level multi-tenancy, and that is it. So, no need to manage multiple databases or schemas for data structures — it keeps it simple. One of the best things I can say about acts_as_tenant is that it can be implemented by an existing application code-base. Too many times, with the older multi-tenant gems, the implementation was invasive, and thus required complex refactoring.

What it’s not: acts_as_tenant is not for account-based sharding — either schema-based or multi-cluster based sharding. It’s purely for multi-tenant safety.

For the paranoid

I have built a few multi-tenant apps in industries with data regulation (think finance and education). I am overly cautious when building multi-tenant apps — so this guardrail is my favorite.

To enforce the tenant_id on every ActiveRecord query within an application, add the following to a initializer file in config/initializers/acts_as_tenant.rb:

ActsAsTenant.configure do |config|
	config.require_tenant = true
end

Having worked in a few multi-tenant apps where showing data to another customer is consequential, I wish acts_as_tenant had an enforcing requirement of a tenant_id for queries. One of the apps I wrote required high-performance, large-scale data loads. We had an intermittent bug where people would be assigned to the incorrect tenant. After tracking down the bug, we found the incident in the implementation of multiple external_ids:

-- bug code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND external_id = $2 OR other_external_id = $2;

-- correct code
SELECT
  *
FROM people
WHERE tenant_id = %1 AND (external_id = $2 OR other_external_id = $2);

The lesson: wrap your OR statements in parenthesis. The bug code interpreted as:

(tenant_id = %1 AND external_id = $2) OR other_external_id = $2;

When using acts_as_tenant, you can avoid this bug when using ActiveRecord models. Below, you’ll see that ActiveRecord encapsulates the following:

Remember, if you choose to use raw SQL, you’ll need to keep your guard up.

Testing from `rails new app`

To install from a new rails application, do the following:

Run rails new multi-tenant-app
Decide on your application’s tenant model: typically Organization or Account or Team or School. Use the underscore version of the name with _id appended as your tenant id for all columns, such as organization_id or account_id or team_id or school_id. Below, we will use the tenant name Account.
Add gem "acts_as_tenant" to Gemfile, and run bundle install.
Create some models:

rails g model Account name:string
rails g model User email:string account_id:integer
rails g model Post content:string user_id:integer account_id:integer

rails db:create && rails db:migrate

Add the following to app/models/account.rb

class Account < ApplicationRecord

  has_many :users
  has_many :posts

end

Add the following to app/models/post.rb:

class Post < ApplicationRecord

  belongs_to :user
  acts_as_tenant :account

end

Add the following to app/models/user.rb:

class User < ApplicationRecord
  acts_as_tenant :account
  validates_uniqueness_to_tenant :email
end

Now, let’s experiment with the Rails REPL:

rails console

Then, you can run the following commands:

first_account = Account.create!(name: "First Account")
last_account = Account.create!(name: "Last Account")

ActsAsTenant.with_tenant(first_account) do
  user = User.create!(email: "test@example.com")
  post = Post.create!(user: user, content: "Lorem Ipsum")
end

ActsAsTenant.with_tenant(first_account) do
  Post.first.content # -> "Lorem Ipsum"
end

ActsAsTenant.with_tenant(last_account) do
  Post.first.nil? # -> true because we did not create a tenant
end

Post.first.content # -> "Lorem Ipsum"

ActsAsTenant.configure do |config|
  config.require_tenant = true
end

Post.first.content # -> ActsAsTenant::Errors::NoTenantSet (ActsAsTenant::Errors::NoTenantSet)

When looking at the queries that are run by ActiveRecord, you’ll see it automatically appends the account_id to the User and Post that are created. Later, after we set require_tenant, you’ll see that the next command fails with an error.

From the terminal, we explicitly used with_tenant. acts_as_tenant has helpers for the controller as well. Depending on how your authentication systems and tenancy work, you can use domains, subdomains, or implicit tenancy based on the authenticated user. From here, you’ll need to implement something like:

class ApplicationController < ActionController::Base
  set_current_tenant_through_filter
  before_action :require_authentication
  before_action :set_tenant

  def require_authentication
    current_user || redirect_to(new_session_path)
  end

  def current_user
    @current_user ||= if session[:user_id].present?
			User.find(session[:user_id])
    end
  end

  def current_acount
    @current_account ||= current_user.try(:account)
  end

  def set_tenant
    set_current_tenant(current_account)
  end
end

Implementation of proper authentications are complex, so this is simply for example. The code specific to acts_as_tenant are set_current_tenant_through_filter and before_action :set_tenant and def set_tenant.

Migrating to acts_as_tenant

If you have an existing codebase that would benefit from acts_as_tenant, the migration is a process and can be broken into multiple steps:

Add a tenant_id column to each affected model - this step can be quite complicated. It requires data migrations and data updates. The method of updating columns will be dependent on the size of your database.
Add the acts_as_tenant gem, but do not set require_tenant yet
Define the tenancy for your ApplicationController using either domains, subdomains, or filter
Define the tenancy for your Action Job
Define tenancy for your models

Taking a measured approach to migrating, you can deploy each of the steps above independently. And, you can deploy each model change independently of the entire change.

Removing acts_as_tenant

The best thing I can say about a library is: you can migrate away from it if it does not work for you. Because acts_as_tenant is not a deep integration as past multi-tenant libraries, it is possible to move away from acts_as_tenant.

Summary

Back in the 2009-ish era, Ruby on Rails and “The Cloud” grew up together when cloud-SaaS and social networks took off. Back then, the maximum performance of network attached storage was 100 IOPs and size maxed out at 1TB. The IOPs strangled database performance, and 1TB was an unbreakable limitation (if you did not RAID early). I started my career in that era. Due to infrastructure limitations, multi-tenant databases would start to see issues when an application hit as little as 50 requests per second. In this era, RAM was expensive and disk performance was not available. Because of this, “sharding” was talked about at all the conferences.

Side note: also, data was suddenly available everywhere, and there were business models that stored massive amounts of data hoping to figure out a business model later.

Now, in 2023, RAM is plentiful and IOPs are available. Scaling the database can be punted to 10s of thousands of requests per second.

Why do I say all this? Because now, we can approach multi-tenant apps and scaling more practically. Multi-tenant can focus on data-security and coding-practically instead of scaling. You may not ever get to the point of needing distributed data stores, but a solid multi-tenant implementation creates foundational success for your application.

The old multi-tenant Ruby Gems were for scalability. acts_as_tenant is built for practicality.

Use Github Actions on Pull Requests to Automate Postgres on Crunchy Bridge

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Thu, 07 Dec 2023 08:00:00 EST

Automating pull requests to deploy staging applications is a game changer for large teams performing shipping quality products. Using Crunchy Bridge’s CLI or API, you can easily automate the entire process for these staging deployments. The simplest workflow would look something like the following:

In this example, during the “Create Postgres Cluster”, we’ll create a hobby-0 cluster for Postgres. Then, when the PR is closed the cluster will be torn down. We keep it simple for this example, but depending on your use case you can expand the capabilities.

For teams that like to have an anonymized dataset for staging, they use the Crunchy Bridge CLI to fork the production cluster, then run an anonymization process on the forked cluster. For teams that are running PRs often, they could have an anonymized cluster available to be forked as well. You can also create an empty database and add an automated process to load a seed file.

Naming convention

Since naming is among the two hardest things in computer science, let’s tackle it early. I like to keep my cloud tidy, so I give my Postgres clusters predictable names. Names can be added to the cluster at the time of creation. I like to name the automation clusters with something like:

{github repository name}-merge-{pull request id}

If you have a repository with name Wayne-Enterprises/Batmans_Code and a PR with id 7, it will end up with a cluster like this:

batmans-code-merge-7

Later, you’ll see this in the code we use to generate the name. Of course, you can change this to whatever you like.

Prepping your Crunchy Bridge account

Next, log into your Crunchy Bridge account, and go to My Account → API Keys. Add an API key for this account:

Since I like to be tidy, I also use production teams and developer teams on my cloud services. So, I create a developer team:

After creating this team, grab the team’s id from the URL, as you’ll need it in a moment. In the URL, you’ll see something that looks like https://crunchybridge.com/teams/gqa4owetwbdfvfacpdxhgf2qmu/dashboard. From it, grab the string that is between “teams/” and “/dashboard”. For the above URL, it would be gqa4owetwbdfvfacpdxhgf2qmu — but yours will be something different. This is your team’s id that we will use in a GitHub Secret in a moment.

Adding GitHub Actions Secrets

GitHub secrets allow you to add sensitive values to your GitHub Actions without revealing them to the world. When the GitHub Action runs, you can use the syntax ${{ secrets.CRUNCHY_BRIDGE_API_KEY }} to request the sensitive value at that time. Note: it’s important that your sections never print the secrets, else it can be viewed by anyone with access to the repository.

To add GitHub Secret, go to the repository, then click Settings → Secrets and Variables → Actions. From there, add the secrets for CRUNCHY_BRIDGE_API_KEY and CRUNCHY_BRIDGE_STAGING_TEAM_ID that were created in the previous section.

Once complete, you should see the following:

Adding the workflow file

Now, add a file to your repository at .github/workflows/crunchy-bridge-review-cluster.yml:

name: Crunchy Bridge Review Cluster

on:
  pull_request:
    types: [opened, reopened, closed]

permissions:
  contents: read

jobs:
  launch:
    if: ${{ github.event.pull_request.state == 'open' }}
    runs-on: ubuntu-latest
    name: Launch Crunchy Bridge Review Cluster

    steps:
      - name: Create Crunchy Bridge Review Cluster
        run: |
          export CRUNCHY_BRIDGE_CLUSTER_NAME=$(echo "$GITHUB_REPOSITORY-$GITHUB_REF_NAME" | sed 's/^[^\/]\+\///' | sed 's/[^0-9A-z\-]\+/-/g' )
          export CB_API_KEY=${{ secrets.CRUNCHY_BRIDGE_API_KEY }}

          wget https://github.com/CrunchyData/bridge-cli/releases/download/v3.4.0/cb-v3.4.0_linux_amd64.zip
          unzip cb-v3.4.0_linux_amd64.zip

          (./cb list | grep $CRUNCHY_BRIDGE_CLUSTER_NAME) || ./cb create --platform aws --region us-east-1 --plan hobby-0 --team ${{ secrets.CRUNCHY_BRIDGE_STAGING_TEAM_ID }} --storage 10 --name $CRUNCHY_BRIDGE_CLUSTER_NAME --version 16

          ./cb uri $CRUNCHY_BRIDGE_CLUSTER_NAME

          for i in $(seq 1 120)
          do
            (./cb info $CRUNCHY_BRIDGE_CLUSTER_NAME | grep 'state: ready') && exit 0
            echo -n '.'
            sleep 5
          done

          exit 1
  teardown:
    if: ${{ github.event.pull_request.state == 'closed' }}
    runs-on: ubuntu-latest
    name: Delete Crunchy Bridge Review Cluster

    steps:
      - name: Delete Crunchy Bridge Test Cluster
        run: |
          export CRUNCHY_BRIDGE_CLUSTER_NAME=$(echo "$GITHUB_REPOSITORY-$GITHUB_REF_NAME" | sed 's/^[^\/]\+\///' | sed 's/[^0-9A-z\-]\+/-/g' )
          export CB_API_KEY=${{ secrets.CRUNCHY_BRIDGE_API_KEY }}

          wget https://github.com/CrunchyData/bridge-cli/releases/download/v3.4.0/cb-v3.4.0_linux_amd64.zip
          unzip cb-v3.4.0_linux_amd64.zip

          (./cb list | grep $CRUNCHY_BRIDGE_CLUSTER_NAME) && ./cb destroy $CRUNCHY_BRIDGE_CLUSTER_NAME --confirm || exit 0

To trigger the workflow, commit the file, push to GitHub, and create a pull request. The file is fairly simple. It sets environmental variables for the name of the cluster to be created and the Crunchy Bridge API key. Then, it downloads the Crunchy Bridge CLI called cb. Then, it either creates the cluster or deletes the cluster depending on the state of the pull request.

Sending the connection URI to the application

To get the Postgres URI , do the following and it will return the full URI string for the cluster:

export CB_API_KEY=${{ secrets.CRUNCHY_BRIDGE_API_KEY }}
wget https://github.com/CrunchyData/bridge-cli/releases/download/v3.4.0/cb-v3.4.0_linux_amd64.zip
unzip cb-v3.4.0_linux_amd64.zip
./cb uri $CRUNCHY_BRIDGE_CLUSTER_NAME

If you were deploying to Heroku, the final line would look like:

heroku config:set DATABASE_URL=$(./cb uri $CRUNCHY_BRIDGE_CLUSTER_NAME)

If you were deploying with to your own stack, you could write to the database/config.yml to a file:

cat << EOF | tee config/database.yml
default: &default
  adapter: postgresql
  encoding: unicode
  pool: 5

staging:
  url: $(./cb uri $CRUNCHY_BRIDGE_CLUSTER_NAME)
EOF

Generally, use the $(./cb uri $CRUNCHY_BRIDGE_CLUSTER_NAME) command to retrieve the value to write to your environments connection of choice.

CLI v. API

In this tutorial, we chose to keep it simple with the CLI. Previously, when configured for another Ruby on Rails application, I had written this interaction with a Rake file. In that scenario, I had built a process highly specific to that application and Ruby on Rails. The CLI allowed me to use the tools at hand, and I did not have to mess with language libraries. Crunchy Bridge has an amazingly powerful API. If you need custom functionality and would like to orchestrate it with your language of choice, check out the API.

Protecting production

When automating review applications, I’m always watching to make sure I’m automating the correct database. For that reason, Crunchy Bridge has “Cluster Protection” to keep your production clusters from being a victim of an automation failure. To turn on, go to your Cluster, then go to Settings → General → Danger Zone → Cluster Protection.

Quick summary

Automate your database creation with your GitHub actions and review apps! Seriously, they're amazing.
Name your provisions predictably and save your API key in GitHub secrets.
Create a custom workflow file that runs upon pull request to have a github action create your new database and wait until its ready. Also within that workflow file specify that upon removal of the pull request, the database will automatically be removed.
Send the connection string to your review application and start testing.

Ruby on Rails Neighbor Gem for AI Embeddings

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 03 Nov 2023 09:00:00 EDT

Over the past 12 months, AI has taken over budgets and initiatives. Postgres is a popular store for AI embedding data because it can store, calculate, optimize, and scale using the pgvector extension. A recently introduced gem to the Ruby on Rails ecosystem, the neighbor gem, makes working with pgvector and Rails even better.

Background on AI in Postgres

An “embedding” is a set of floating point values that represent the characteristics of a thing (nothing new, we’ve had these since the 70s). Using the OpenAI API or any of their competitors, you can send over blocks of text, images, and pdfs, and OpenAI will return an embedding with 1536 values representing the characteristics. With the pgvector extension, you can store that embedding in a vector column type on Postgres. Then, using nearest neighbor calculations, you can then find the most-similar objects. For a deeper review of AI with Postgres, see my previous posts in this series.

The neighbor gem

By default, Ruby on Rails does not know about the "vector" data type. If you've used Ruby on Rails + Postgres + pgvector, you've probably written SQL queries in your migrations, and implemented some other janky-code. The neighbor gem will remove the janky-code, and take you back to a native ActiveRecord experience.

At a minimum, all you have to do is add the following to you Gemfile:

gem 'neighbor'

Side note: I can't overstate the impact Andrew Kane has had on embedding data in Postgres. He's also making it easy for developers to use those vector data types with Ruby on Rails and Node.

Fixed schema dump

The biggest risk of not using Neighbor is that ActiveRecord will create a failing db/schema.rb file. Because ActiveRecord does not understand the vector data type, instead of failing, running rails db:schema:dump will omit any table with that data type. It will show this error in your db/schema.rb:

# Could not dump table "recipe_embeddings" because of following StandardError
#   Unknown type 'vector(1536)' for column 'embedding'

With Neighbor, you'll get a fully-functional schema like the following:

create_table "recipe_embeddings", primary_key: "recipe_id", id: :bigint, default: nil, force: :cascade do |t|
    t.vector "embedding", limit: 1536, null: false
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["embedding"], name: "recipe_embeddings_embedding", opclass: :vector_l2_ops, using: :hnsw
    t.index ["recipe_id"], name: "index_recipe_embeddings_on_recipe_id"
end

Notice that Neighbor also understands the []hnsw index type](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector) released with pgvector 0.5.

Side note: for projects that go all-in on Postgres, I opt to use the following to dump to a db/structure.sql:

SCHEMA_FORMAT=sql rails db:schema:dump

Easier migrations + data type handling

Without Neighbor, ActiveRecord is not informed of vector. Just as your config/schema.rb file is important for your typical migration would look something like the following:

create_table :recipe_embeddings, primary_key: [:recipe_id] do |t|
  t.references :recipe, null: false, foreign_key: true
  t.vector :embedding, limit: 1536, null: false

  t.timestamps
end

Additionally, you get improved handling of the vector data type. Without Neighbor, working with embedding data required to_s to manipulate the values when inserting into Postgres. But, with Postgres, it's simplifies to a native process:

RecipeEmbedding.create!(recipe_id: Recipe.last.id, embedding: [-0.078427136, 0.0014401458, ...])

But, wait! There's more …

The `nearest_neighbor` method

After you add the embedding column to a table, you can use has_neighbors to define your nearest neighbor queries:

class RecipeEmbedding < ApplicationRecord
  has_neighbors :embedding
end

Then, you can find the nearest neighbors like so:

recipe_embedding.nearest_neighbors(:embedding, distance: "euclidean").first

The distance calcuations include euclidean and cosine.

Conclusion

Launching a project to use embeddings with Ruby on Rails?

Step 1: use the neighbor gem

Step 2: provision your database on Crunchy Bridge with pgvector

Step 3: profit

Postgres Goodies in Ruby on Rails 7.1

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Mon, 16 Oct 2023 09:00:00 EDT

I just spent last week at Rails World in Amsterdam and had a blast digging back into the Rails and Active Record world. In conversations with developers over the week, I had some notable takeaways from the newest version of Ruby on Rails that I just had to get written up.

A quick summary before we dig in:

async queries: send long-running queries the background while the code runs along, great for pages with multiple long-running queries that can be run in parallel
composite primary keys: native support for using two or more columns as a primary key
common table expression (CTEs): native integration for a subquery for use later in the statement
unlogged tables: native support for disabling Postgres’ WAL logs on a table (mostly for test environments), so that you get better performance on your tests that use databases
value normalization: a native, universal syntax for normalization of values (like downcase of a email column) instead of using before_validation

Expansion of Async queries

A feature (not a bug IMO) of Ruby is that it has traditionally been used as a blocking (i.e. not-asynchronous) language. While it does have asynchronous capabilities, in the typical use-case people do not have to grok asynchronous workflows to use it effectively.

In the 7.0 release, Active Record added load_async for loading whole objects. In 7.1, asynchronous queries have been enabled for aggregations and in full SQL queries using async_find_by_sql.

First, you'll need to define the async_query_executor in your environment files (config/environments/{development, production}.rb).

To use a global setting, use something like the following:

config.active_record.async_query_executor = :global_thread_pool
config.active_record.global_executor_concurrency = 5

To use a per-database setting, use something like the following and min_threads + max_threads the database.yml:

config/environments/{development, production}.rb

config.active_record.async_query_executor = :multi_thread_pool

config/database.yml

development:
  adapter: postgresql
  pool: 5
  max_threads: 5
  min_threads: 5

After setting one of those configurations, we can see how the queries work asynchronously:

irb> u = User.async_find_by_sql("SELECT *, pg_sleep(3) FROM users") # this query will sleep for 3 seconds for each record in your database
=> #<ActiveRecord::Promise status=pending>

Then, sometime later you can use the value syntax to retrieve the value:

# … sometime later …
irb> u.value
=> {returned results}

Once you call the .value method, if the query has returned, it will return instantly. If the query has not returned, processing is blocked until the query is complete. This is great for dashboards, charts, and reports that generate more complex queries. Send the query to the background, and let it process while you complete other elements of the request.

Composite primary keys

Composite primary keys have been noticeably absent from Ruby on Rails for a while -- unless you chose the CPK gem. Rails 7.1 added two native methods for implementing Primary Keys: database level and application level.

Database Level Composite Primary Keys: When defining the table in the database migration, you can pass an array of column names to the primary_key attribute. For databases capable of composite primary keys (like Postgres), Active Record will infer from the schema:

    create_table :user_accounts, primary_key: [:user_id, :account_id] do |t|
      t.belongs_to(:user, foreign_key: true)
      t.belongs_to(:account, foreign_key: true)

      t.string :role

      t.timestamps
    end

After running this migration, if you run \d user_accounts on your Postgres database, you’ll see the following line for the composite key:

"user_accounts_pkey" PRIMARY KEY, btree (user_id, account_id)

Then, when querying with the composite keys, you do the following:

UserAccounts.find([1, 2]) # where user_id = 1 and account_id = 2

In the hypothetical use-case above, we use the composite primary keys for a join table between users and accounts. Typically, in the past I would use an id column with a unique constraint on the user_id and account_id values. Now, with Postgres, we can use the composite primary key for the row

Application Level Composite Primary Keys: Rails documentation calls this a "virtual primary key". And, you’ll want to know that it’s a bit more restrictive than the composite primary keys above because it enforces an explicit foreign key definition on relationships:

class UserAccounts < ActiveRecord::Base
  query_constraints :user_id, :account_id

  belongs_to :user, foreign_key: :user_id
  belongs_to :account, foreign_key: :account_id
end

You have to be explicit on the foreign_key definitions of the relationship, else it tries to find user_id and account_id on every related model.

My recommendation is to use Postgres and the native composite primary keys in a database.

Native Support for CTEs

A "CTE" is a "Common Table Expression". A CTE is a type of a nested SQL statement that is defined before the SQL. Below is an example using native-SQL that would find the latest event for each user on an account:

WITH latest_event_per_user AS (
	SELECT
		user_id,
		MAX(event_logs.created_at) AS last_created_at
	FROM event_logs
	WHERE
		event_logs.account_id = 1
	GROUP BY 1
)

SELECT
	event_logs.user_id,
	event_logs.name,
	event_logs.created_at
FROM event_logs
	INNER JOIN latest_event_per_user ON
		event_logs.user_id = latest_event_per_user.user_id AND
		event_logs.created_at = latest_event_per_user.last_created_at;

When would you use something like this? Above is a query that returns the latest events for each user on an account. Another common usage of CTEs is when generating the data for charts and reports. Most of the time, generating this data at the SQL level is much faster than generating it using application level logic. Application logic would require some type of N + 1 query, which can be avoided with a more expressive SQL query.

To support CTEs, Active Record added .with() method for chaining queries. The with() accepts an Object, which is quite nice when using a large block. Writing the same query from above in Active Record would look like the following:

latest_event_per_user = EventLog.
	where(account_id: 1).
	group(:user_id).
	select(:user_id, "max(event_logs.created_at) AS last_created_at")

el = EventLog.
	with(my_cte: latest_event_per_user).
	joins("JOIN my_cte ON event_logs.user_id = my_cte.user_id AND event_logs.created_at = my_cte.last_created_at")

My recommendation: if you are SQL-nerd enough to use CTEs, consider using ActiveRecord::Base.connection.execute() to execute the raw sql. The one time I could see using this Active Record CTE syntax is if you need the chaining capabilities due to conditional query creation.

Support for unlogged tables (test env only!)

Postgres in test environments do not need the persistence that Postgres needs in production. Enter UNLOGGED TABLE, which does not apply Postgres durability of a table, but improves performance:

# config/environments/test.rb

ActiveSupport.on_load(:active_record_postgresqladapter) do
  self.create_unlogged_tables = true
end

ActiveRecord::Base.normalizes

Most modern databases (including Postgres) are case-sensitive by default. And, users have been known to randomly capitalize values. So, it's best to sanitize values. If you've written a Rails application, you've probably written something like the following:

class User < ApplicationRecord
	before_validation do
	  self.email = self.email.strip.downcase
	end
end

Now, we have an native way to do this with normalizes:

class User < ApplicationRecord
	normalizes :email, with: -> given_value { given_value.strip.downcase }
end

For those who don't know about the -> syntax, this is a function with a single argument called given_value.

Summary

The PostgreSQL, Active Record, and Ruby on Rails communities continue to show investment in features to make this a strong stack for data heavy production applications.

Solving N+1 Postgres queries for Ruby on Rails apps

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 21 Apr 2023 09:00:00 EDT

Crunchy Data is getting ready to be at RailsConf 2023 in Atlanta next week and we’ve been thinking about our Rails and ActiveRecord users and customers. One of the easiest ways to improve query performance using an ORM is to lean on as much SQL as you can. I’m going to walk through some of the ActiveRecord basics and how to use some smart SQL to work around N+1 query problems.

The easy CRUD Basics with ActiveRecord

What do I mean by "CRUD"? It's short-hand for create-read-update-delete. For instance, ORMs make it so nice to do any of the following.

Insert a record:

batman = user.create(name: "Batman", email: "batman@wayne-enterprises.com")

Find a record:

user = User.find(batman.id)

Update a record:

user.update(email: "batman@retired.com")

Destroy a record:

user.destroy

ORMs can even manage relationships and joins:

batman.sidekicks = robin
User.find(batman.id).joins(sidekick: :user)

The above would obviously return Robin.

Sometime in the 1970s, superheroes switched from one-to-one hero-to-sidekick ratio to having multiple side-kicks, or functioning as a group. Then, Marvel Universe started introducing groupings of superheroes. The Marvel Universe of superheroes is like teenager group chats -- not every superhero likes every other superhero.

Hulk and Iron Man -- you don't want them in the same room together, unless you have to.

But, I digress.

SQL Superpowers: ON vs. WHERE

This type of grouping relationship necessary for managing superheroes is what ties ORMs into knots. Anytime you want to append a conditional join, they get quite messy.

Below is what I mean when I say conditional join, it is a join, but it conditions with the ON statement:

SELECT
	*
FROM table
LEFT JOIN other_table ON conditional_1 AND conditional_2

This query will return all rows of table, but exclude any other_table rows where conditional_1 or conditional_2 are false. So, results look something like this:

 table.id | other_table.conditional_1 | other_table.conditional_2 |
----------|---------------------------|---------------------------|--
        1 |                      true |                      true |
        2 |                           |                           |

If we put the conditional in WHERE instead of the ON, and ran this query:

SELECT
	*
FROM table
LEFT JOIN other_table ON conditional_1
WHERE conditional_2

Then it only returns results where all conditions are met:

 table.id | other_table.conditional_1 | other_table.conditional_2 |
----------|---------------------------|---------------------------|--
        1 |                      true |                      true |

If you notice, there is only a single row returned. The usage of the WHERE conditional filters out the entire second row.

So, sometimes, filters need to be in the join’s ON clause, instead of being in the WHERE clause.

An ORM in knots

Using Rails’ ActiveRecord ORM, let's return a list of superheroes, then if they are in a chat group owned by Hulk, return those groups as well. We would probably start with something like this:

Users
	.left_joins(:group_users => groups)
	.where(group_users: {groups: {owner_id: hulk.id}})

This would generate a query that looks something like this:

SELECT
	*
FROM users
LEFT OUTER JOIN group_users ON group_users.user_id = users.id
LEFT OUTER JOIN groups ON group_users.group_id = groups.group_id
WHERE
	groups.owner_id = [[hulk_user_id]]

This has the problem we defined before: it filters out all rows that do not return true for the conditional. So, it's not returning all users, it's only returning users who belong to a group that is owned by Hulk.

Iron Man would be mad. He'd probably even threaten to take his toys and go home, until someone told him it was just a bug in the software.

A false positive, unless …

With ActiveRecord, there appears to be a way to do this, but it's a false positive. Using SQL fragment runs the query that we want:

users = Users
	.joins(:group_users)
	.joins(ActiveRecord::Base.sanitize_sql_array("LEFT JOIN groups ON group_users.group_id = groups.id AND groups.owner_id = ?", hulk.id]))

But, when accessing the object's relationships, we get all related rows, not the ones you want (i.e. the conditional join did not stick):

users.first.group_users.groups => all groups, unfiltered

In Rails 6.1, the strict_loading functionality was added that makes this join behave properly. Run the same ruby code above, and append strict_loading, and this will prevent additional lazy loading.

users.strict_loading.first.group_users => filtered groups

Should we settle with N + 1?

The typical alternative is to just settle with N + 1 from the controller or the template. It's an attempt to solve data retrieval shortcomings using application level code:

<% users.each do |user| %>
	<%= user.name %>
  <% user.group_users.includes(:groups).where(group_users: {group_id: params[:group_id]}).each do |group_user|
    <%= group_user.group.name %>
  <% end %>
<% end %>

Of course, this works … but, it does not scale. It will be fast in development, and it will run fast with small data sets. But, it runs a query for each user record. If the application grows, the loop above will run an additional query for each user displayed.

There is a better way.

Let's just use SQL instead

First, we'll use the quick-and-dirty method. It will call some of the code internals for ActiveRecord.

Let's use ActiveRecord::Base.connection.execute to run the SQL. We'll also use ActiveRecord::Base.sanitize_sql_array to securely inject values to safely build the SQL query.

results = ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql_array([<<SQL, hulk_user_id]))
SELECT
	users.id AS id,
	users.name AS name,
	groups.name AS group_name
FROM users
LEFT OUTER JOIN group_users ON group_users.user_id = users.id
LEFT OUTER JOIN groups ON group_users.group_id = groups.group_id
	AND	groups.owner_id = ?
ORDER BY users.name
SQL

Then, in the view, the following code can be used to iterate over the returned values:

<% results.each do |row| %>
  <%= row["id"] %>
  <%= row["name"] %>
  <%= row["group_name"] || "--" %>
<% end %>

Clean it up to make it a little nicer

To clean up the code a bit when running multiple SQL queries, I typically do something like this. I searched for a modern Ruby Gem to handle this type of issue, but none were immediately obvious as being stable and maintained.

Store one-file per query in the app/models/sql directory. So the query above would be stored in a file called app/models/sql/all_users_and_groups_with_specific_owner.sql so the above query would look like this:

SELECT
	users.id AS id,
	users.name AS name,
	groups.name AS group_name
FROM users
LEFT OUTER JOIN group_users ON group_users.user_id = users.id
LEFT OUTER JOIN groups ON group_users.group_id = groups.group_id
	AND	groups.owner_id = ?
ORDER BY users.name

Then, we can have a model that handles these queries for us. Save the following to app/models/sql.rb

class Sql
	def self.run(sql_name, *arguments)
		sql = File.read(File.join(Rails.root, 'app', 'models', 'sql', sql_name + '.sql'))
		sanitized_sql = ActiveRecord::Base.sanitize_sql_array(sql, *arguments)
		ActiveRecord::Base.connection.execute(sanitized_sql)
	end
end

Then, when running a SQL command, just do the following:

result = Sql.run("all_users_and_groups_with_specific_owner", hulk_user_id)

Using this method, it puts the SQL query into a space away from the rest of our code. Then, in that SQL file, we can include comments to help our future-selves read the SQL and know why we are using it.

What about database lock-in?

By querying with raw SQL, you will be locked into a database. However, once a raw SQL becomes necessary for performance, it is best to favor database lock-in -- the alternative being slower, generic database interactions.

Once you decide on the database for the long-haul, there is no better database than open-source 100% native Postgres.

Summary

ActiveRecord is awesome for getting started with databases in your Rails application but performance wise, there can be some limits.
N+1 queries are a common issue with ActiveRecord or an ORM, and can become more of hindrance as the application scales.
Writing SQL and embedding that as a model is an easy way to add sql to your application. You’ll be locked into PostgreSQL for the long haul, but that’s ok, there’s no better database for a Rails production application.

See you next week at RailsConf!