CrunchyData Blog

Name Collision of the Year: Vector

Elizabeth.Christensen@crunchydata.com (Elizabeth Christensen) — Thu, 26 Dec 2024 08:30:00 EST

I can’t get through a zoom call, a conference talk, or an afternoon scroll through LinkedIn without hearing about vectors. Do you feel like the term vector is everywhere this year? It is. Vector actually means several different things and it's confusing. Vector means AI data, GIS locations, digital graphics, and a type of query optimization, and more. The terms and uses are related, sure. They all stem from the same original concept. However their practical applications are quite different. So “Vector” is my choice for this year’s name collision of the year.

In this post I want to break down the vector. The history of the vector, how vectors were used in the past and how they evolved to what they are today (with examples!).

The original vector

The idea that vectors are based on goes back to the 1500s when René Descartes first developed the Cartesian coordinate XY system to represent points in space. Descartes didn't use the word vector but he did develop a numerical representation of a location and direction. Numerical locations is the foundational concept of the vector - used for measuring spatial relationships.

The first use of the term vector was in the 1840s by an Irish mathematician named William Rowan Hamilton. Hamilton defined a vector as a quantity with both magnitude and direction in three-dimensional space. He used it to describe geometric directions and distances, like arrows in 3D space. Hamilton combined his vectors with several other math terms to solve problems with rotation and three dimensional units.

The word Hamilton chose, vector, comes from the Latin word vehere meaning ‘to carry’ or ‘conveyor’ (yes, same origin for the word vehicle). We assume Hamilton chose this Latin word origin to emphasize the idea of a vector carrying a point from one location to another.

There’s a book about the history of vectors published just this year, and a nice summary here. I’ve already let Santa know this is on my list this year.

Mathematical vectors

Building upon Hamilton’s work, vectors have been used extensively in linear algebra pre and post computational math. If it has been 20 since you took a math class here’s a quick refresher.

Linear algebra is a branch of mathematics that focuses on vectors, matrices, and arrays of numbers. Here’s a super simple mathematical vector equation. We have two points on an XY coordinate system, point A at 1, 2 and B at 4,6. The vector formula for this is below in this diagram, final solution 3,4.

Linear algebra of much more complicated forms is used in solving systems of linear differential equations. Vector equations have practical use cases in physics and engineering for things we use every day like heat conduction, fluids, and electrical circuits.

Computer science vectors

Early computer scientists made heavy use of the vector in a variety of ways. A computational vector can be similar to the example above or even just a simple numeric array of fixed size with where the numbers have related values. In early computer programming, simple operations like additions or subtraction would be applied to a set of vectors.

A basic example of this could be financial portfolio analysis where you have two vectors: 1 - Portfolio weights, v1, showing the proportion of investment in different stocks and 2 - market impact adjustments, v2, that adjusts markets based on current values. This code sample here in C calculates the adjusted weights for each stock in the portfolio by adding the two vectors.

#include <stdio.h>

#define STOCKS 8

typedef float Portfolio[STOCKS];

int main() {
    // Portfolio weights (in percentages, out of 100)
    Portfolio portfolioWeights = {10.0, 20.0, 15.0, 25.0, 5.0, 10.0, 10.0, 5.0};
    // Market impact adjustments (positive or negative percentages)
    Portfolio marketAdjustments = {0.5, -0.3, 1.0, -0.5, 0.2, -0.1, 0.0, 0.7};
    Portfolio adjustedWeights;

    // Perform vector addition
    for (int i = 0; i < STOCKS; i++) {
        adjustedWeights[i] = portfolioWeights[i] + marketAdjustments[i];
    }

    // Print adjusted weights
    printf("Adjusted Portfolio Weights: <");
    for (int i = 0; i < STOCKS; i++) {
        printf("%s%.1f%%", i > 0 ? ", " : "", adjustedWeights[i]);
    }
    printf(">\n");

    return 0;
}

Modern computer science builds on similar concepts of organizing and processing collections. The std::vector in C++ and Vec<T> in Rust are general-purpose dynamic arrays. They can be virtually any data type to help manage or compute collections of elements.

Graphics and vectors

Vector graphics were used in early arcade and video game development. Think of something like Spacewar! or Asteroids. Vectors could be used to draw lines and shapes like ships and stars.

Here’s a super simple example of how vectors could be used to draw a triangle.

#define DrawLine(pt1, pt2)

typedef struct Point {
    int x, y;
} Point;

typedef struct Line {
    Point start;
    Point end;
} Line;

Line lines[3] = {
    {{0, 0}, {100, 100}},  // Line 1
    {{100, 100}, {200, 50}}, // Line 2
    {{200, 50}, {0, 0}}    // Line 3
};

// Loop through these points to draw our triangle on the screen.
int main()
{
    for (int i = 0; i < 3; i++)
    {
        DrawLine(lines[i].start, lines[i].end);
    }
    return 0;
}

These early xy arrays and computerized graphics paved the way for modern computer graphics which make use of vectors in even more advanced ways. When you play a modern 3D video game, many characters, objects, and movement you see on the screen are powered by linear algebra vectors.

The Graphics Processing Unit (GPU) was a specialized computer developed in the 1990s and then improved on in the decades since. GPUs handle the millions of vector operations required to create 3D graphics in real time. GPUs now are used for far more than 3D graphics. Vector-based assembly operations can operate on a continuous block of memory, doing the same operation across different chunks of memory.

Scalable vector graphics (SVG)

SVGs are 2D vector graphics that have become a de-facto image format in web design and development. There’s a vector standard that allows svg graphics to be created with a series of numbers that represent shapes and paths that work across devices and web browsers. SVG graphics display logos, icons, charts, and animations. Their popularity took off in the mid 2010s and continues to grow as they remain popular due to their performance and lightweight nature.

SVGs use some number of vector numbers to describe the object they represent. For a simple SVG with a few shapes might be dozens of numbers. A more complex SVG like one for a detailed icon or map might include thousands of numbers.

Here’s what the SVG of the Crunchy Data hippo logo looks like:

<svg
	id="aad9811e-aeeb-4dae-a064-7d889077489a"
	data-name="Layer 6"
	xmlns="http://www.w3.org/2000/svg"
	viewBox="0 0 1407.15 1158.38"
>
	<path
		d="M553.21,651l124.3,122.4-154.9-89Zm-304.5-496.6-54.6,148.9L35.71,415.19,6.81,523.49l-6.5,67.9,83.1,65.2h0l208.7-10.3,114.1-155.7,3.6-166,199.3-200.5-104.7-41.9Zm0,0,360.4-30.3m-104.7-41.9-114.1,61.4-130.7,213.5-105.5,150.5-70.8,149m322.9-166-145.9-135.4-222.5,62.1M294.21,642l-140.1-135.1L1,586.39m36.1-171.2,116.3,91,190.8-73.1m-95.5-278.7L259.61,357m150.1-32.4-19.4-181m218.8-19.5,14.7,196.7-59.5,137.4-49.1,104-92.7,47.2-128.8,35.9,139.8,39.3L621.21,632l62.4-196.3,16.7-174.4-92.4-136.9M621.21,632l-215-141.5,26.7,194-349.6-28m617-395.2-294.1,229.3,215,141.5m-217.1,50.2,8.6,306.7-17.5,35.7,6.1,52.8,101.7-4.8,63.5-63.9,6-47.9L588.41,792h0l89.2-18.4,97.2,23.4,84.2,19.7-2.1,46.5,10.5,30.4-19,28.9,28.1,1.9,1.6-.8,6,105.5-15.1,40.1,25.3,88.7,132.1-33-6.1-50.6,65.5-306.8,49.5-12.2,57-43,29,41.1,2.4,88.3,5.8,61.8-18.6,46.2,23.5,38.7,96.5-12.4,44.3-43.5-21.1-28.8,13.8-216.9,4-65.5,34.6-116.4-23.4-120.4-332.8-215.1L842,135l-151.2,47.5m119.9,84.8-202.4-143.1m202.4,143.1L849,552.39l134.2-214.2ZM1164,453.09l-180.8-115-42.6,277Zm-486.5,320.4,263-158.4L849,552.39Zm133.2-506.2-110.6-4-4.6,48.5,115-42.3m-133,504-154.9-89,65.7,107.4Zm170.3-25.9,35.1,87,57.6-219.4Zm117.7,83.3-25-215.8-57.6,219.4Zm-24.9-215.8,25,215.8,120.2-63.5Zm12.7,418.8,94-83.9-81.9-119.1Zm-105.5-285.6-170.3,25.2,200,47.7ZM1164,453.09l-70.6,270.3,141.1-114Zm70.5,156.3,77.8-132.8L1195,262.89Zm-251.3-271.3,180.8,115,31.1-190.2Zm67.1-168.8-67.1,168.8,211.9-75.2ZM842,135l-151.2,47.5,359.5-13.9Zm244.2,633.2,7.2-44.8m167.2-63.1,51.8-183.7-77.9,132.8Zm0,0-26.1-50.9-99.3,145.8Zm0,0,84.1-88.7-32.4-95Zm84.1-88.7-84.1,88.7,42.4-7.6Zm-22.6-226.7-9.8,131.7,32.4,95Zm0,0,22.6,226.7,62-69Zm46.3,339.3-65.3-30.2,56.7,161.5Zm-114.7,122.3,77.3-31.9-28.1-121.8Zm49.2-153.7,28.1,121.8,28.9,40.9Zm69.3-32.3-27.5-48.9,23.7,112.6ZM1331,774.59l-4.7,123.7,33.6-82.7Zm-93.9,213.3,94.5-12.7-5.4-78.4Zm16.6-181.4-30,35.1,13.4,139.9,63.4-138.2Zm0,0-33.1-115.9,3.1,150.6Zm-32.8-115.2,82.2-37.2m-73.5,249.3,7.6,84.6m94.5-12.8,43.7-42.9-49.1-35.5Zm-5.8-79.2,29.1,7.3m-942.3,85.6-11.4,88.5,63.4-55.8Zm51.2,31.9,38.7,52.5,63.8-64.5Zm556,53.9-66.6-40.8-59.2,123.9Zm-431.6-282.8-112.2,70.4-11.4,159.3Zm-178.6,89.3,2.9,107.7,63.5-126.6Zm238-729.1,40.7-57.4L702,45.29l-13.6-32L650.11.49l-13.6,2.6-31.2,41.3-10.3,73,14.1,6.7ZM650,.49l-48.6,74.7,81.4-45.9Zm32.7,28.4L702,45.19m-19.1-15.3,5.5,64.8L647.31,110l-38.2,14.1m0,0-7.7-48.9m87-61.9-5.5,16.6L650,.59m-269.3,116-4.1-59.1-45-22.9-43.7,26.8,2.7,42.8,11.5,35.3M346.21,81l-14.6-46.5-41,69.7L346.21,81l-43.8,58.5m74.2-82.1L346.21,81l34.5,35.6m486.4,777.9,10.9,29m4.9-90.7-15.6,60.6,10.7,30.1Zm-407,32,46.7-180.3-112.9,196.7m23.2-196.6,89.7-.1,30.6-33.4M744.81,394l-10.6,113.9L849,552.39Zm-75.5,84.8L621.21,632l113.1-124.1Zm64.9,29.1-56.7,265.6m0,0,27.2-133.3-83.6-8.1Zm68.1-380.1-59.2,18m9-99.7,49.4,82.3,65.7-124.6Zm-289.2,178.9,277.3-54.9m200.3,594.7,31-31.4,50.7-168.1m-82.6,1.9,31.9,166.1,38.5,34.9M1331,774.59l-30.4,68.7,25.8,53.5M287.91,61.39l23.9,6.7"
		fill="none"
		stroke="currentColor"
		stroke-linejoin="bevel"
	/>
</svg>

GIS vector data

In modern computational GIS, vectors are used to represent geometric data types like points, line-strings, and polygons. Like any other x,y,z vector coordinate system the vectors refer to specific global points or objects. There’s quite a few different spatial reference systems that can be used. The vectors are typically stored in PostGIS using a binary format Well-Known Binary (WKB), which is a standardized binary encoding for geometries. Vectorization also powers many of the key functions in modern geospatial data processing like intersections, distance calculations, joins, and proximity analysis.

Here’s the vector binary for (imho) the best BBQ restaurant in the world:

 restaurant_name |                        geom
-----------------+----------------------------------------------------
Gates Bar B Q    | 0101000020E610000082E673EE76A557C007B47405DB884340

AI Vectors

AI vectors emerged from the mathematical and computational foundations of vectors that I covered above. Through advancements in hardware and in machine learning algorithms, vectors can be used as a system to describe virtually anything. Large Language Models (LLMs) convert data like text, images, or other inputs into vectors through a process called embedding. LLMs use layers of neural networks to process the embeddings in a specific context. So the vectors numerically represent relationships between objects within the context they were created with.

You’ve probably heard of the pgvector extension that is used for storing and querying AI related embedding data. pgvector adds a custom data type vector for storing fixed-length arrays of floating-point numbers. pgvector stores up to 16k dimensions.

My colleague Karen Jex has a great embedding talk she does about AI called “What’s the Opposite of a Corn Dog”. The vector embedding for a corn dog from an OpenAI menu dataset is an array of a staggering 1536 numbers. Here’s a snippet.

// vector of a Corn Dog
[0.0045576594,-0.00088141876,-0.014024569,-0.011641564,0.0038251784,0.010306821,-0.01265076,-0.013672978,-0.01582159,-0.041670028,0.0044274405,.........0.040185533,-0.010463083,0.004326521,-0.019571891,0.01853014,0.025770308,-0.017787892,0.0018572462]

In AI and machine learning, a vector is an ordered list of numbers that represents data for literally anything. Really what “AI” is doing is turning anything and everything into a vector and then comparing that vector with other vectors in the same matrix.

Vectorized queries

As the use of computational vectors have become so popular along with machine learning, the underlying methods and CPU hardware for processing vector data is now used to process other kinds of data.

There are several databases on the market now like DuckDB, Big Query, Snowflake, and Crunchy Data Warehouse that make use of vectorized query execution to speed up analytics queries. Vectorized database queries split up and streamline queries into similar results over chunks of data of the same type. In a way, they’re treating columns of data like mathematical vectors. This can be much more powerful than reading data row by row. The power here also comes from the parallelization and effective CPU and IO usage.

The values processed with vectorized execution are typically treated as vectors in the sense that they’re contiguous batches of data elements. Surprisingly, they do not need to represent mathematical vectors—they can be any kind of data that fits the processing model.

Vectors are everywhere!

Vectors are everywhere and they can mean virtually anything in a computerized context - especially now with AI - everything is or can be a vector.

Vectors and their uses are one of the main characters in the story of modern computing. An evolution from pen and ink math to modern ML algorithms. The beauty of the vector in its infinite use of numeric representation. From simple concepts like a point on the globe to computerized graphics and animation, and AI embeddings for any text or image.

Vector use summary:

Attributions

Hamilton’s Lecture on Vectors

Smarter Postgres LLM with Retrieval Augmented Generation

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Mon, 09 Dec 2024 08:30:00 EST

"Retrieval Augmented Generation" (RAG) is a useful technique in working with large language models (LLM) to improve accuracy when dealing with facts in a restricted domain of interest.

Asking an LLM about Shakespeare: works pretty good. The model was probably fed a lot of Shakespeare in training.

Asking it about holiday time off rules from the company employee manual: works pretty bad. The model may have ingested a few manuals in training, but not yours!

Is there a way around this LLM limitation? Yes! We can create a separate table of data and merge that with our LLM data for our own RAG system.

Example

For this example, we will build a query function that can accurately answer questions about Star Trek: The Next Generation (TNG) episodes.

We run the LLM locally using Ollama
We access the LLM using the OpenAI API access extension for Postgres functions described in our last blog post on sentiment analysis.

The Llama3-8B model does not know a lot about Star Trek when queried directly.

SELECT openai.prompt(
  'You are a science fiction expert!',
  'What is the Star Trek episode where Deanna and her
   mother are kidnapped?'
);

I believe you might be thinking of the Star Trek:
The Next Generation episode "Dark Page", which is
Season 5, Episode 9. In this episode,
Lwaxana Troi (Deanna's mother) visits the
Enterprise and plans to attend a ceremony on Betazed
with her husband. However, she is kidnapped by a
group of space pirates who want to use her as bait
to lure her son Jori-Troi back onto their ship.

This is... a pure hallucination. It is "trekish" but it references an episode that doesn't exist, and characters that don't exist either. We need a way to inject more facts into the LLM query.

New raw data

To improve our TNG trivia bot, we will build a chunk database using plot summaries of TNG episodes. This database table will be a source of facts we can feed the LLM.

CREATE TABLE tng (
    title text,
    plot text
    );

COPY tng (title, plot)
    FROM PROGRAM 'curl https://raw.githubusercontent.com/pramsey/pgsql-openai/refs/heads/main/examples/rag/tng.txt'
    WITH (
        FORMAT csv,
        DELIMITER E'\t'
        );

Each row of the tng table contains a title and episode number slug, and a plot summary for the episode. Unfortunately this data set does not include all 178 episodes.

Embedding

One of the most magical aspects of LLM technology is how the data are modeled under the covers. Just a collection of tokens in an extremely high (1500 or more) dimensional space.

You can take a phrase or a paragraph and hand it to a model and ask for an "embedding" and it will spit back a single high dimensional vector that uniquely characterizes it.

Amazingly, paragraphs that discuss the same concepts have embedding vectors that are "close" to each other in the embedding space. Really!

As a very simple example, the vector for "puppies" will be close to the vector for "dogs" and also (in a different direction) close to the vector for "kittens").

Searching Embeddings with pgvector

pgvector is a PostgreSQL extension that adds a "vector" data type that can handle the really high dimensionality used by LLM models, as well as some index schemes for quickly searching large collections of those vectors.

-- Enable pgvector
CREATE EXTENSION pgvector;

-- Add an emedding column to the table
ALTER TABLE tng
    ADD COLUMN vec vector;

-- Populate the column with embeddings from an LLM model
UPDATE tng
    SET vec = openai.vector(title || ' -- ' || plot)::vector;

Now we have an embedding for every episode summary. Is the embedding of the episode we are looking for "close" to the embedding of the trivia question?

SELECT title
FROM tng
ORDER BY vec <-> (SELECT openai.vector('What is the Star Trek episode where Deanna and her mother are kidnapped?')::vector)
LIMIT 5

                         title
--------------------------------------------------------
 Star Trek: The Next Generation, Ménage à Troi (#3.24)
 Star Trek: The Next Generation, Cost of Living (#5.20)
 Star Trek: The Next Generation, The Loss (#4.10)
 Star Trek: The Next Generation, Manhunt (#2.19)
 Star Trek: The Next Generation, Unification I (#5.7)

There it is, and it's even the first entry! Ménage à Troi is in fact the episode where Deanna and her mother are kidnapped (by the duplicitous Ferengi!)

Augmenting the query with our new data

For this example, our query has been a question about TNG: "What is the Star Trek episode where Deanna and her mother are kidnapped?"

We can augment our query by bundling together all the title and plot summary information in the "related" records we found in the last section, and feeding them to the LLM along with the query text.

Let's automate the whole chain in one PL/PgSQL function:

Lookup the embedding vector for the query string.
Find the 5 closest entries to that embedding vector.
Pull the 5 plot summaries and titles together into one lump of context.
Run the query string against the LLM along with the context lump.

CREATE OR REPLACE FUNCTION trektrivia(query_text TEXT)
    RETURNS TEXT
    LANGUAGE 'plpgsql' AS $$
DECLARE
    query_embedding VECTOR;
    context_chunks TEXT;
BEGIN
    -- Step 1: Get the embedding vector for the query text
    query_embedding := openai.vector(query_text)::VECTOR;

    -- Step 2: Find the 5 closest plot summaries to the query embedding
    -- Step 3: Lump together results into a context lump
    SELECT string_agg('Episode: { Title: ' || title || ' } Summary: {' || plot, E'}}\n\n\n') INTO context_chunks
    FROM (
        SELECT plot, title
        FROM tng
        ORDER BY vec <-> query_embedding
        LIMIT 5
    ) AS similar_plots;

    -- Step 4: Run the query against the LLM with the augmented context
    RETURN openai.prompt(context_chunks, query_text);
END;
$$;

Running the RAG

Now we can run the RAG query and see if we get a better answer!

SELECT trektrivia('What is the Star Trek episode where Deanna and her mother are kidnapped?');

 The answer is: Star Trek: The Next Generation - "Menage à Troi"
 (Season 3, Episode 24)
 In this episode, Counselor Deanna Troi's mother, Lwaxana,
 is kidnapped by the Ferengi along with Commander William Riker,
 and they demand that Captain Picard declare his love for
 Lwaxana in exchange for her safe release.

Exactly correct! With the right facts in the context, the LLM was able to compose a coherent and factual answer to the question.

Conclusion

There is no doubt that using RAG can increase the quality of LLM answers, though as always the answers should be taken with a grain of salt. This example was built with a 9B parameter model running locally, so the extra context made a big difference. Against a frontier model, it probably would not.

Also, it is still possible to get wrong answers from this RAG system, they just tend to be somewhat less wrong. RAG is not a panacea for eliminating hallucination, unfortunately.

Useful Links

pgvector
Ollama
pgsql-openai
Chunking in RAG applications
Enterprise Line Art from US Patent 307923S

Accessing Large Language Models from PostgreSQL

Paul.Ramsey@crunchydata.com (Paul Ramsey) — Wed, 13 Nov 2024 09:30:00 EST

Large language models (LLM) provide some truly unique capacities that no other software does, but they are notoriously finicky to run, requiring large amounts of RAM and compute.

That means that mere mortals are reduced to two possible paths for experimenting with LLMs:

Use a cloud-hosted service like OpenAI. You get the latest models and best servers, at the price of a few micro-pennies per token.
Use a small locally hosted small model. You get the joy of using your own hardware, and only paying for the electricity.

Amazingly, you can do either approach, and use the same access API to hit the LLM services, because the OpenAI API has become a bit of an industry standard.

OpenAI access extension

Knowing this, it makes sense to build a basic OpenAI API access extension in PostgreSQL to make using the API quick and easy. The extension we built for this post has three functions:

openai.models() returns a list of models being served by the API
openai.prompt(context text, prompt text) returns the text answer to the prompt, evaluated using the context.
openai.vector(prompt text) returns the vector embedding of the prompt text.

The OpenAI API just accepts JSON queries over HTTP and returns JSON responses, so we have everything we need to build a client extension, combining native PostgreSQL JSON support with the http extension.

There are two ways to get the extension functions:

You can install the extension if you have system access to your database.
Or you can just load the openai--1.0.sql file, since it is 100% PL/PgSQL code. Just remember to CREATE EXTENSION http first, because the API extension depends on the http extension.

Local or remote

The API extension determines what API end point to hit and what models to use by reading a handful of global variables.

Using OpenAI is easy.

Sign up for an API key.
Set up the key, URI and model variables.

SET openai.api_key = 'your_api_key_here';
SET openai.api_uri = 'https://api.openai.com/v1/';
SET openai.prompt_model = 'gpt-4o-mini';
SET openai.embedding_model = 'text-embedding-3-small';

Using a local Ollama model is also pretty easy.

Download Ollama.
Verify you can run ollama
- then ollama pull llama3.1:latest
- and ollama pull mxbai-embed-large
Set the up the session keys

SET openai.api_uri = 'http://127.0.0.1:11434/v1/';
SET openai.api_key = 'none';
SET openai.prompt_model = 'llama3.1:latest';
SET openai.embedding_model = 'mxbai-embed-large';

If you want to use the same setup over multiple sessions, use the ALTER DATABASE dbname SET variable = value command to make the values persistent.

Testing with sentiment analysis

Assuming you have your system set up, you should be able to run the openai.models() function and get a result. Using Ollama your result should look a bit like this.

SELECT * FROM openai.models();

            id            | object |       created       | owned_by
--------------------------+--------+---------------------+----------
 mxbai-embed-large:latest | model  | 2024-11-04 20:48:39 | library
 llama3.1:latest          | model  | 2024-07-25 22:45:02 | library

LLMs have made sentiment analysis almost too ridiculously easy. The main problem is just convincing the model to restrict its summary of the input to a single indicative value, rather than a fully-written-out summary.

For a basic example, imagine a basic feedback form. We get freeform feedback from customers and have the LLM analyze the sentiment in a trigger on INSERT or UPDATE.

CREATE TABLE feedback (
    feedback text, -- freeform comments from the customer
    sentiment text -- positive/neutral/negative from the LLM
    );

The trigger function is just a call into the openai.prompt() function with an appropriately restrictive context, to coerce the model into only returning a single word answer.

--
-- Step 1: Create the trigger function
--
CREATE OR REPLACE FUNCTION analyze_sentiment() RETURNS TRIGGER AS $$
DECLARE
    response TEXT;
BEGIN
    -- Use openai.prompt to classify the sentiment as positive, neutral, or negative
    response := openai.prompt(
        'You are an advanced sentiment analysis model. Read the given feedback text carefully and classify it as one of the following sentiments only: "positive", "neutral", or "negative". Respond with exactly one of these words and no others, using lowercase and no punctuation',
        NEW.feedback
    );

    -- Set the sentiment field based on the model's response
    NEW.sentiment := response;

    RETURN NEW;
END;
$$ LANGUAGE 'plpgsql';

--
-- Step 2: Create the trigger to execute the function before each INSERT or UPDATE
--
CREATE TRIGGER set_sentiment
    BEFORE INSERT OR UPDATE ON feedback
    FOR EACH ROW
    EXECUTE FUNCTION analyze_sentiment();

Once the trigger function is in place, new entries to the feedback form are automatically given a sentiment analysis as they arrive.

INSERT INTO feedback (feedback)
    VALUES
        ('The food was not well cooked and the service was slow.'),
        ('I loved the bisque but the flan was a little too mushy.'),
        ('This was a wonderful dining experience, and I would come again,
          even though there was a spider in the bathroom.');

SELECT * FROM feedback;

-[ RECORD 1 ]-----------------------------------------------------
feedback  | The food was not well cooked and the service was slow.
sentiment | negative

-[ RECORD 2 ]-----------------------------------------------------
feedback  | I loved the bisque but the flan was a little too mushy.
sentiment | positive

-[ RECORD 3 ]-----------------------------------------------------
feedback  | This was a wonderful dining experience, and I would
            come again, even though there was a spider in
            the bathroom.
sentiment | positive

Conclusion

Before LLM, sentiment analysis involved multiple moving parts and pieces. Now we can just ask the black box of the LLM for an answer, using plain English to put parameters around the request.

Local model runners like Ollama provide a cost effective way to test, and maybe even deploy, capable mid-sized models like Llama3-8B.

Ruby on Rails Neighbor Gem for AI Embeddings

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 03 Nov 2023 09:00:00 EDT

Over the past 12 months, AI has taken over budgets and initiatives. Postgres is a popular store for AI embedding data because it can store, calculate, optimize, and scale using the pgvector extension. A recently introduced gem to the Ruby on Rails ecosystem, the neighbor gem, makes working with pgvector and Rails even better.

Background on AI in Postgres

An “embedding” is a set of floating point values that represent the characteristics of a thing (nothing new, we’ve had these since the 70s). Using the OpenAI API or any of their competitors, you can send over blocks of text, images, and pdfs, and OpenAI will return an embedding with 1536 values representing the characteristics. With the pgvector extension, you can store that embedding in a vector column type on Postgres. Then, using nearest neighbor calculations, you can then find the most-similar objects. For a deeper review of AI with Postgres, see my previous posts in this series.

The neighbor gem

By default, Ruby on Rails does not know about the "vector" data type. If you've used Ruby on Rails + Postgres + pgvector, you've probably written SQL queries in your migrations, and implemented some other janky-code. The neighbor gem will remove the janky-code, and take you back to a native ActiveRecord experience.

At a minimum, all you have to do is add the following to you Gemfile:

gem 'neighbor'

Side note: I can't overstate the impact Andrew Kane has had on embedding data in Postgres. He's also making it easy for developers to use those vector data types with Ruby on Rails and Node.

Fixed schema dump

The biggest risk of not using Neighbor is that ActiveRecord will create a failing db/schema.rb file. Because ActiveRecord does not understand the vector data type, instead of failing, running rails db:schema:dump will omit any table with that data type. It will show this error in your db/schema.rb:

# Could not dump table "recipe_embeddings" because of following StandardError
#   Unknown type 'vector(1536)' for column 'embedding'

With Neighbor, you'll get a fully-functional schema like the following:

create_table "recipe_embeddings", primary_key: "recipe_id", id: :bigint, default: nil, force: :cascade do |t|
    t.vector "embedding", limit: 1536, null: false
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["embedding"], name: "recipe_embeddings_embedding", opclass: :vector_l2_ops, using: :hnsw
    t.index ["recipe_id"], name: "index_recipe_embeddings_on_recipe_id"
end

Notice that Neighbor also understands the []hnsw index type](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector) released with pgvector 0.5.

Side note: for projects that go all-in on Postgres, I opt to use the following to dump to a db/structure.sql:

SCHEMA_FORMAT=sql rails db:schema:dump

Easier migrations + data type handling

Without Neighbor, ActiveRecord is not informed of vector. Just as your config/schema.rb file is important for your typical migration would look something like the following:

create_table :recipe_embeddings, primary_key: [:recipe_id] do |t|
  t.references :recipe, null: false, foreign_key: true
  t.vector :embedding, limit: 1536, null: false

  t.timestamps
end

Additionally, you get improved handling of the vector data type. Without Neighbor, working with embedding data required to_s to manipulate the values when inserting into Postgres. But, with Postgres, it's simplifies to a native process:

RecipeEmbedding.create!(recipe_id: Recipe.last.id, embedding: [-0.078427136, 0.0014401458, ...])

But, wait! There's more …

The `nearest_neighbor` method

After you add the embedding column to a table, you can use has_neighbors to define your nearest neighbor queries:

class RecipeEmbedding < ApplicationRecord
  has_neighbors :embedding
end

Then, you can find the nearest neighbors like so:

recipe_embedding.nearest_neighbors(:embedding, distance: "euclidean").first

The distance calcuations include euclidean and cosine.

Conclusion

Launching a project to use embeddings with Ruby on Rails?

Step 1: use the neighbor gem

Step 2: provision your database on Crunchy Bridge with pgvector

Step 3: profit

HNSW Indexes with Postgres and pgvector

Christopher.Winslett@crunchydata.com (Christopher Winslett) — Fri, 01 Sep 2023 09:00:00 EDT

Postgres’ pgvector extension recently added HNSW as a new index type for vector data. This levels up the database for vector-based embeddings output by AI models. A few months ago, we had written about approximate nearest neighbor pgvector performance using the available list-based indexes. Now, with the addition of HNSW, pgvector can use the latest graph based algorithms to approximate nearest neighbor queries. As with all things databases, there are trade-offs, so don’t throw away the list-based methodologies — and don’t throw away the techniques we discussed in scaling vector data.

TL;DR

HNSW is cutting edge for all vector based indexing. To build an HNSW index, run something like the following:

CREATE INDEX ON recipes
USING hnsw (embedding vector_l2_ops)
WITH (m = 4, ef_construction = 10);

These indexes will:

use approximations (not precision)
be more performant than list-based indexes
require longer index build times
and require more memory

Tradeoffs:

Indexes will take longer to build depending on values for m and ef_construction. When increased, these values will slow the speed of index build drastically, while not improving performance. Yet, it may increase accuracy of response.
To search more than 40 nearest neighbors, increase this SET hnsw.ef_search = x; value. Where x is the value of nearest neighbors you want to return.
Not all queries will work with HNSW. As we said in the vector performance blog post, use EXPLAIN to ensure your query is using the index. If it is not using the index, simplify your query until it is, then build back to your complexity.

What is HNSW?

HNSW is short for Hierarchical Navigable Small World. But, HNSW isn’t just one algorithm — it’s kind of like 3 algorithms in a trench coat. The first algorithm was first presented in a paper in 2011. It used graph topology to find the vertex (or element) with the local minimum nearest neighbor. Then, a couple more papers were published, but the one in 2014 combined use of multiple small worlds (i.e. hierarchy) + probability distributions + graph traversal to approximate nearest neighbors. Think of all of these algorithms as layers of graphs that get progressively more detailed.

Let’s walk through a bit of an illustration. This illustration is important to understand how tuning parameters affect performance and resource usage later. First off, imagine a 1536 dimensional universe points spread throughout a universe, and we want to find the nearest 5 points. That’s tough to think about, right? Instead, cut that down to a 2-dimensional plane. Imagine the following image as the known population of values. Place your finger somewhere on this image and then imagine trying to find the nearest 5 neighbors from this population:

If you were querying nearest neighbor from a list in a database, would you iterate over all known points and measure distance? If so, that is the most costly algorithm. How can we optimize this? First, let’s create a sparse population of just 12 points:

Place your finger in the same place on the image above and find the 5 closest points. Next, let’s add an additional layer that connects the points from the sparse layer to points with a denser layer constructed as a graph of points.

From the top layer in the diagram above, take your 10 closest points, and trace the arrows to the 10 closest points in the bottom layer. Then, traverse all connecting lines (in graph theory, they are called edges) to connecting points (vertices). While comparing distance, only keep a list of the 10 closest -- discard any that are farther away than the top-10. (this 10 value will be associated with the ef values we discuss later).

Then, let’s add another layer, and do the same thing one more time:

As you can see the illustration gets a bit messy, but the understanding is still there, I hope. The next layer repeats the same algorithm with denser data. Navigate to the points connecting to the densest layer, traverse their graphs, and run the top-10 algorithm once again. Once those are compared, extract the top-5 from the list.

You can see where the “Hierarchical Navigable Small Worlds” name comes from. It’s a hierarchy in that the series of data gets denser and denser. It’s navigable in that you can move from sparse to dense. These multiple small worlds lead to fewer calculations to get to the approximate answer. Previously, I said that it’s actually 3 algorithms in a trench coat. Arriving at the complete HNSW algorithm piggy-backed on the success of the prior.

The code for HNSW is pretty amazing. It’s clean enough that even non-C programmers can follow along to see the logic of building indexes and scanning indexes, see these links for a peek at the underlying code:

After you understand the HNSW thesis, you can go back and read the HnswSearchLayer function for fun. Additionally, see how the HNSW implementation calculates and caches distances

The advantages of HNSW

HNSW is much faster to query than the traditional list-based query algorithm. This performance is because the use of graphs and layers reduces the number of distance comparisons that are being run. And because fewer distance comparisons, we can run more queries concurrently as well.

Tradeoffs for HNSW

The most obvious trade off for HNSW indexes is that they are approximations. But, this is no different than any existing vector index, so aside from a table-scan of comparisons. If you need absolutes, it is best to run the non-indexed query that calculates distance for each row.

The second trade-off for HNSW indexes is they can be expensive to build. The two largest contributing variables for these indexes are: size of the dataset and complexity of the index. For moderate datasets of > 1M rows, it can take 6 minutes to build some of the simplest of indexes. During that time, the database will use all the RAM it has available in maintenance_work_mem, while redlining the CPU. Long-story short, test it on a production-size dataset before embarking.

The third trade-off for HNSW indexes is that they are sizable — the index for 1M rows of AI embeddings can be 8GB or larger. For performance reasons, you’ll want all of this index in memory. HNSW is fast because it uses resources.

Index tuning values

In the illustrations above, we showed how the index progressed through executing a query. But how are these indexes built? Think of index build for HNSW as a massive query that pre-calculates a larger number of distances. Index tuning is all about how the database limits the algorithms to build those indexes. Go back to the initial illustration and ask the question “how do we build an HNSW index from the data set?”

Points are saved to the top and middle layer based on probability: 1% are saved to the top layer, and 5% are saved to the middle layer. To build the index, the database loops through all values. As it loops to the next value, the algorithm uses the previously built index and the same algorithm described above to place the value within the graph. When building, each point needs to be on the graph, thus each point needs to connect to the nearest points it can find. On large datasets, it would be impossible to scan all rows and connect them in a graph to their closet neighbors within a reasonable time — thus use the following limits to constrain the build process.

m

m is the number of connections to nearest neighbors (vertices) made per point per layer within the index. When building the graph indexes, the database is seeking nearest neighbors to build out the vertices, and m is the maximum count for that layer. By limiting the number of connections, the index limits the number of connections between points at that layer. Thus, the build time improves with smaller values of m. All you have to know is that in the original paper, as m approaches infinity, it creates “graphs with polylogarithmic complexity”.

ef_construction

ef_construction is candidate list-size used during index build. Above, in the illustration, I was telling you to keep a list of 10 closet, and discard any outside those 10. During index build, the database walks through each record placing that record’s values within the index structure. Later records use the index that is currently built to more quickly position the currently processed record. As the index build process moves through the layers, it keeps a list of closest values from the prior layer. During the build, the list is sorted by distance and truncated at a length of ef_construction’s value. Once the value is placed within the graph, this list will be truncated to the length of m. The relationship between ef_construction and m is the reason ef_construction is required to be 2x the value of m. The larger the value for ef_construction the slower the index build.

What is the best values for m and ef_construction? In our tests, we have confirmed the statements from the original paper:

The only meaningful construction parameter left for the user is M. A reasonable range of M is from 5 to 48. Simulations show that smaller M generally produces better results for lower recalls and/or lower dimensional data, while bigger M is better for high recall and/or high dimensional data.

And for ef_construction:

Construction speed/index quality tradeoff is controlled via the efConstruction parameter. (…) Further increase of the efConstruction leads to little extra performance but in exchange of significantly longer construction time.

So, long-story short, keep the numbers relatively small because the quality improvement isn’t worth the performance hit.

Query tuning values

ef_search

This value functions the same as the ef_construction value, except for queries. This is a query-time parameter that limits the number of nearest neighbors maintained in the list. Because of this, ef_search functions as 2 limiters: maximum number of records returned and limitation on accuracy. If you are trying to return 100 nearest neighbors and the ef_search value is set to 40 (which is the default), then the query will only be capable of returning 40 rows. ef_search also limits accuracy, as nested graphs will not be traversed beyond the count. Thus, relying on few data points for comparison.

A smaller ef_search value will result in faster queries at the risk of inaccuracy. You can set it for a session as below, or use SET LOCAL to constrain to a transaction.

SET hnsw.ef_search = 5;

Using HNSW - A code sample

For this code sample, we will continue to use the SQL code found within the Postgres AI Tutorial. Pull down that file, and load it into a Postgres database with the pgvector extension. If you do not have a database with the pgvector extension, try Crunchy Bridge for your Postgres hosting and install the extension there. To load the file, run:

bash> cat recipe-tracker.sql | psql 'postgres://user@password:host:port/database'

This schema is a set of recipes from the Armed Services Recipe list. We have categorized these recipes using OpenAI as defined in Postgres + AI blog post in this series. Then, connect to your Postgres database and run this query:

SELECT
   name
FROM
   recipes
ORDER BY
   embedding <-> (
   SELECT
      embedding
   FROM
      recipes
   WHERE
      id = 151 ) 		-- 151 is the primary key for chocolate chip cookies
      LIMIT 5;

The response should be the following:

name
------------------------------
 Cookies, chocolate chip
 Cookies, crisp chocolate
 Cookies, chocolate drop
 Bar, toffee, crisp
 Cookies, peanut butter

If you prepend EXPLAIN to the select statement, you’ll see that the query iterated over 720 rows with on the sort statement:

QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=2754.87..2754.89 rows=10 width=30)
   InitPlan 1 (returns $0)
     ->  Index Scan using recipes_pkey on recipes recipes_1  (cost=0.28..8.31 rows=1 width=18)
           Index Cond: (id = 151)
   ->  Sort  (cost=2746.56..2748.36 rows=720 width=30)
         Sort Key: ((recipes.embedding <-> $0))
         ->  Seq Scan on recipes  (cost=0.00..2731.00 rows=720 width=30)
(7 rows)

Let’s create an index to see how much we can reduce the table scans:

CREATE INDEX
ON recipes USING hnsw (embedding vector_l2_ops) WITH (m = 4, ef_construction = 10);

Now, if you run the same query again, you’ll see that that response is the same as above. With larger data sets, the rows will return sightly different rows due to the effect of approximation:

name
--------------------------
 Cookies, chocolate chip
 Cookies, crisp chocolate
 Cookies, chocolate drop
 Bar, toffee, crisp
 Cookies, peanut butter

To see that the query is using the index, you’ll see that the index scan lists using recipes_embedding_idx index on the next to-last row:

QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=100.49..118.22 rows=5 width=30)
   InitPlan 1 (returns $0)
     ->  Index Scan using recipes_pkey on recipes recipes_1  (cost=0.28..8.31 rows=1 width=18)
           Index Cond: (id = 151)
   ->  Index Scan using recipes_embedding_idx on recipes  (cost=92.18..2645.18 rows=720 width=30)
         Order By: (embedding <-> $0)

As listed in the TL;DR, the index optimizer is not perfect with HNSW, and prefers a simpler query. If we run a similar query but include a CTE, the HNSW index is not used by the optimizer:

EXPLAIN WITH random_recipe AS
(
   SELECT
      id,
      embedding
   FROM
      recipes
   WHERE
      recipes.id = 151 LIMIT 5
)
SELECT
   recipes.id,
   recipes.name
FROM
   recipes,
   random_recipe
WHERE
   recipes.id != random_recipe.id
ORDER BY
   recipes.embedding <-> random_recipe.embedding LIMIT 5;

Long-story short, the simpler the better for HNSW usage.

HNSW indexes and scaling

HNSW indexes are much more performant than the older list-based indexes. They also use more resources. Concurrency is improved, but many of the processes we laid out in the Scaling PGVector blog post are still applicable.

Physical separation of data: because of the build requirements of the indexes, continue to host your vector data on a physically separate database.
Caching: if you are running the same queries many times with very little change to the data, consider using caching.
Dimensional Reduction: dimensional reduction is even better with HNSW indexes. If your dataset is one that works well with dimensional reduction, you can benefit from faster build times and small indexes and improved query times.