Understanding RAG

Large Language Models (LLMs) are impressive, but they have a knowledge cutoff and a tendency to hallucinate (confidently stating falsehoods).

Retrieval-Augmented Generation (RAG) solves this by giving the model an "open-book exam" capability, allowing it to reference specific, up-to-date data before answering.

Key Concepts: Tokens and Embeddings

Before diving into the RAG workflow, we must understand how AI "reads" text.

1. What is a Token?

A token is the basic unit of text for an AI. It can be:

a whole word
a part of a word
punctuation

Rule of Thumb

100 tokens ≈ 75 words

Context Window

If a model has a 128k token limit, it can "read" about 96,000 words at once.

2. Encoding vs. Embedding

While both convert text to numbers, they serve different purposes:

Encoding

A simple transformation into a machine-readable format
Example: ASCII or unique IDs
Has no semantic meaning

Embedding

A sophisticated form of encoding
Text is converted into a dense vector (a list of numbers like [0.12, -0.45, ...])

These vectors act as coordinates in a high-dimensional space.

Semantic Meaning

Words with similar meanings (e.g., "dog" and "puppy") are placed close together in this space.

The RAG Workflow: Step-by-Step

RAG can be broken down into two main phases: Preparation (indexing your data) and Inference (answering the user).

Phase 1: Preparing the Knowledge Base

To make data searchable, we must process it into a format the AI can navigate.

1. Gather & Preprocess

Convert files (PDFs, HTML, Docx) into clean text.

2. Chunking

LLMs have limits on how much they can read. We break large documents into smaller, manageable "chunks" to ensure the retrieval is precise.

3. Embedding & Indexing

Each chunk is passed through an embedding model to create a vector. These vectors are stored in a Vector Database (like ChromaDB, FAISS, or Milvus).

Phase 2: The Retrieval & Generation Process

When a user asks a question, the following happens in milliseconds:

1. Prompt Embedding

The user’s question is converted into a vector using the same embedding model used for the data.

2. Vector Search

The system calculates the distance (often using Cosine Similarity) between the question vector and the stored data vectors. The "closest" chunks are the most relevant.

3. Augmenting the Prompt

The system constructs a new "super-prompt" containing:

Instructions: "Use only the provided context to answer."
Retrieved Context: The relevant chunks found in the database.
User Question: The original query.

4. Generation

The LLM reads the context and generates a grounded, accurate response.

Why Use RAG?

1. Up-to-Date Information

You don't need to retrain a massive model — just update your vector database.

2. Reduced Hallucinations

By forcing the model to cite its sources from the retrieved text, the "confident lying" common in LLMs is significantly reduced.

3. Data Privacy

You can run RAG on "closed" data (internal company documents) without leaking that data into the public training sets of base models.

Comparison of Distance Metrics

To find the "closest" data, developers typically use one of two mathematical approaches:

Metric	Description	Best For
Cosine Similarity	Measures the angle between two vectors.	Semantic similarity (meaning).
Euclidean Distance	Measures the straight-line distance between points.	When the magnitude of the data matters.