Large Language Models (LLMs) are impressive, but they have a knowledge cutoff and a tendency to hallucinate (confidently stating falsehoods).
Retrieval-Augmented Generation (RAG) solves this by giving the model an "open-book exam" capability, allowing it to reference specific, up-to-date data before answering.

Before diving into the RAG workflow, we must understand how AI "reads" text.
A token is the basic unit of text for an AI. It can be:
If a model has a 128k token limit, it can "read" about 96,000 words at once.
While both convert text to numbers, they serve different purposes:
[0.12, -0.45, ...])These vectors act as coordinates in a high-dimensional space.
Words with similar meanings (e.g., "dog" and "puppy") are placed close together in this space.
RAG can be broken down into two main phases: Preparation (indexing your data) and Inference (answering the user).

To make data searchable, we must process it into a format the AI can navigate.
Convert files (PDFs, HTML, Docx) into clean text.
LLMs have limits on how much they can read. We break large documents into smaller, manageable "chunks" to ensure the retrieval is precise.
Each chunk is passed through an embedding model to create a vector. These vectors are stored in a Vector Database (like ChromaDB, FAISS, or Milvus).
When a user asks a question, the following happens in milliseconds:
The user’s question is converted into a vector using the same embedding model used for the data.
The system calculates the distance (often using Cosine Similarity) between the question vector and the stored data vectors. The "closest" chunks are the most relevant.
The system constructs a new "super-prompt" containing:
"Use only the provided context to answer."The LLM reads the context and generates a grounded, accurate response.

You don't need to retrain a massive model — just update your vector database.
By forcing the model to cite its sources from the retrieved text, the "confident lying" common in LLMs is significantly reduced.
You can run RAG on "closed" data (internal company documents) without leaking that data into the public training sets of base models.
To find the "closest" data, developers typically use one of two mathematical approaches:
| Metric | Description | Best For |
|---|---|---|
| Cosine Similarity | Measures the angle between two vectors. | Semantic similarity (meaning). |
| Euclidean Distance | Measures the straight-line distance between points. | When the magnitude of the data matters. |
